RR01. Preliminaries

Types of Learning
일반적인 “learning” paradigm은 크게 다음 3가지 종류로 구분된다.

Learning?

학습이란 무엇일까?

Supervised Learning

Summary

한국어로 ‘supervised’ 는 보통 ‘지도’ 라고 변역되는데, 그래서 정답(label)이 있는 상태로 하는 학습을 일컫는다.

Example

Classification

Regression

LDA(Linear Discriminant Analysis) projection (dimension reduction)

translation

captioning,,,

원본 링크

Unsupervised Learning

Summary

정답 레이블이 없는 경우 하는 학습이다.
현실 데이터들은 레이블이 없는 경우가 대부분이니, 이에 더 적합하다.
비슷한 용어: representation learning, Dimension Reduction

Representation Learning

unsupervised learning과 비슷한 맥락에서도 사용될 수 있지만, 조금 다름.
표현을 학습하는 건데, supervised fine-tuning 같은 것도 그렇게 분류 될 수 있으니,
완전 포함 관계는 아니고, 교집합이 있는 정보.

Dimension Reduction

데이터 차원을 줄이는 방법을 칭함.
즉, unsupervised이면서, representation learning임.

dimension reduction 은 왜 해야 할까?

Visualization

Computer Resource

Example

Clustering

factorization

kernel estimation

PCA(Principle Component Analysis) (representation learning, )

AutoEncoder (representation learning)

원본 링크

Reinforcement Learning

Summary

Reward 라고 하는 값을 long-term으로 봤을 때 maximize 하도록 function을 학습하는 기법이다.
기본적인 paradigm은 Env(환경)에 Agent를 넣고 둘 간의 interaction을 보는 것이다.
주고 받는 정보는 아래와 같으며, agent가 action을 함으로써 state가 transition되고, env로부터 reward가 계산되며 계속 action을 하는 방향으로 학습된다.

ex) games, Go, chess, etc…

Important

long-term의 reward를 극대화하는 것이 목표이므로, 단기적 손실은 희생(sacrifice)한다.

원본 링크

Comparison

일단 셋 다 function을 학습한다.

ML의 목적은 결국, 원하는 데이터를 잘 설명하는 function을 찾는 거니까.
→ model parameter in DL

원본 링크

Canvas: DL - typical Workflow

Data preparation (DL-workflow)

Data preparation

Dataset을 구축하는 단계
: Train-set, Validation-set, Test-set을 준비하는 단계

데이터가 적으면 train, validation은 K-Fold Cross-Validation처럼, 묶어서 쓰기도 함.

Generalization

The ability for a ML model to perform well on unseen data.
→ AGI로 갈 key이지. system-level, developer-level인지 관심이 있다면,
On the Measure of Intelligence

원본 링크

Building Models (DL-workflow)

Building Models

여러 모델 중 데이터 structure, type에 따라 어떠한 모델을 사용할 지 결정하는 단계.
딥러닝이면 NN(Neural Network)이겠으나, NN 들도 종류 엄청 많으니,,

Example

Random Forest

Naive bayes

Nearest Neighbor(k-NN)

Support Vector Machine(SVM)

Neural Network(NNs)

원본 링크

Neural Network

Neural Network(MLP: Multi-Layer-Perceptron)

perceptron에서 출발하여 이를 여러 layer 쌓은 걸 말함.

Warning

Note1. 각 Layer는 linear / non-linear function들오 연결되어 있다.
Note 2. “pattern” == “embedding”, “weights”, “feature representation”, “feature vectors” 다 비슷한 걸 가르킨다.
! Note 3 !. layer의 개수를 셀 때에는 input-layer는 제외한다.

Why Do We Need Non-linearity?
Non-Linearity

Nonlinearity

복잡한 현실 세계의 복잡한 함수를 다루기 위해

matrix의 matmul이 결국 linear하니까, non-linear 함수를 중간에 끼지 않는 이상 결국은 하나의 큰 선형 시스템을 벗어날 수 없음. 세상이 선형함수로만 설명 가능할리 없잖아,,
→ 선형 연산을 거듭해도 선형.

Numerical Proof

Assume that,

$f = W_{2} \cdot max (0, W_{1} \cdot x)$
where $x \in R^{D}$ , $W_{1} \in R^{H \times D}$ , $W_{2} \in R^{C \times H}$

ReLU

$ReLU (x) = max (0, x)$
one of non-linear activation

In this time, it is impossible to distinguish below 2 models,

$f = W_{2} \cdot W 1 \cdot x$
$f = W_{3} \cdot x$
where $W_{3} \in R^{C \times H}$

→ 결국 linear
→ 그러니 non-linear function을 activation으로 사용하자!!

Linearly Separable ← 고민 좀 해보자. 왜 좋냐?

Important

non-linear map은 data를 linearly-separable하게 바꿀 수 있다.

Examples
Activation Functions

나중에 transformer 계열에 최근 가장 많이 사용되는 GeLU 계열이나 찾아볼 것.

Summary

No one-size-fits-all: Choice of activation function depends on problem.

We only showed the most common ones, there exist many more

Best activation function / model is often using trial-and-error in practice

It is important to ensure a good “gradient flow” during optimization

Rule of Thumb

Use ReLU by default(with small enough lr)

Try Leaky ReLU, Maxout, ELU for some small additional gain

Prefer Tanh over sigmoid(Tanh often used in RNNs)
ReLU

ReLU(Rectified Linear Unit)

activation 중 하나로 대표적으로 많이 사용됨.
$g (x) = max (x, 0)$

Does not saturate(for $x > 0$ )

Leads to fast convergence

Computationally efficient

Problems

No learning for $x < 0$ → dead/dying ReLU

downstream gradient가 0(input이 0 이하일 때,)

often initialize with pos. bias ( $b > 0$ )

Outputs are not zero-centered → introduces bias after the layer

sigmoid와 그 gradient는 항상 positive이기 때문에 model wight의 bias

$y = g (ax + b)$ 라 할 때, $sgn (\partial L / \partial a_{i})$ = $sgn (\partial L / \partial g)$

모든 gradient는 동일한 부호를 가짐.

다음과 같이 구현.
ReLU
def relu(x: float) -> float:
	return np.maximum(0, x)
단점들 보안을 위해 Leaky ReLU 등이 있음.

How can we calculate non-differentiable function (like ReLU)?

→ Subgradient

원본 링크
Sigmoid

Sigmoid

activation 중 하나로 대표적으로 많이 사용됨.
$g (x) = \frac{1}{1 + exp ( - x )}$

range: $[0, 1]$

Neuroscience interpretation as saturating “firing rate” of neurons

Problems

Saturation “kills” gradient : 입력 값의 절댓값이 커질수록 기울기 소실

특히나 layer가 깊어질수록, chain-rule에 의해 곱해지는 term들이 많아질텐데, 그러면 앞 쪽에 있는 layer 들일수록 learning되지 않을 수 있지.

Outputs are not zero-centered → introduces bias after the layer

sigmoid와 그 gradient는 항상 positive이기 때문에 model wight의 bias

$y = g (ax + b)$ 라 할 때, $sgn (\partial L / \partial a_{i})$ = $sgn (\partial L / \partial g)$

모든 gradient는 동일한 부호를 가짐.

다음과 같이 구현.
Sigmoid
def sigmoid(x: float) -> float:
	return 1 / (1 + np.exp(-x))
원본 링크
tanh

tanh(hyperbolic tangent)

activation 중 하나로 대표적으로 많이 사용됨.
$g (x) = \frac{2}{1 + exp ( - 2 x )} - 1$

range: $[- 1, 1]$

anti-symmetric

zero-centered

Problems

Saturation “kills” gradient : 입력 값의 절댓값이 커질수록 기울기 소실

특히나 layer가 깊어질수록, chain-rule에 의해 곱해지는 term들이 많아질텐데, 그러면 앞 쪽에 있는 layer 들일수록 learning되지 않을 수 있지.

다음과 같이 구현.
tanh
def tanh(x: float) -> float:
	return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))
원본 링크
원본 링크
원본 링크
원본 링크

Training Models (DL-workflow)

Summary

이 단계에서는 model이 함수 f를 data로부터 “learn”.
→“statistical machine learning”이라고 불리는 이유.

Train

Train이라는 건, goal로 model을 유도.
goal = Loss Function, objective function

따라서 model이 loss를 최소화하는 parameter를 찾아가야 함.(Optimization)

원본 링크

Evaluating Models (DL-workflow)

When to stop "learning?"

loss 값 자체를 사용하는 건 별로, intuitive하지 않다.

작은 loss는 generalization 성능을 말해주는 게 아닌, 단순 train-set에 대한 explainability를 말해주니,

모델 구조, 데이터 셋, optimization method 등에 따라 값 자체가 너무 달라지니, 일관성있는 비교가 힘듦.

아래는 대표적인 evaluation metric이다.

For Regression

Important

Regression task에서, ground true(label)는 floating point value 즉, prediction value는 저 값 자체에 가까워지는 것이 목표이므로, loss 자체가 evaluation metric으로써의 기능도 할 수 있다. 따라서 아래에 있는
Mean Square Error(MSE), Mean Absolute Error(MAE) 같은 경우는 loss 임과 동시에 evaluation metric의 기능도 할 수 있다.

다만, 꼭 같은 값을 evaluation metric으로 사용할 필요는 없으나, 학습하는 과정에서 사용한 loss와 metric이 일치할 경우에는 loss 가 줄어드는 정보만으로 evaluation value도 감소할 것이라 추정할 수는 있겠지.

Mean Square Error(MSE)

Root Mean Square Error(RMSE)

Mean Absolute Error(MAE)

Mean Absolute Percentage Error(MAPE)

$R^{2}$

For Classification

Accuracy

Recall

Precision

F1 Score

Confusion Matrix

원본 링크

Improving Performance (DL-workflow)

Summary

앞 단계의 evaluating을 통해 학습 부진이라고 판단된다면, 성능을 올려야 함.
앞 단계의 평가로 인해 가능한 대표적인 시나리오는 다음과 같음.

Check

transfer learning은 overfitting, underfitting 모두에 사용할 수 있다는 점.

Underfitting

Underfitting

Summary

모델이 더 이상 충분히 loss 를 줄이지 못함.
Training loss가 validation loss에 비해 낮음. → model의 capacity가 충분치 못함.

Capacity vs Complexity

Capacity: 이론적으로 모델이 표현할 수 있는 function space의 크기
Complexity: 실제 학습된 모델의 표현력

Help

To solve the problem, several methods are recommended.

Add more layer/units to model: model의 capacity의 부족으로 인한 문제를 직접적으로 해결.

Tweak the learning rate : 기존에 잡힌 lr이 너무 커도 underfit이 되기도 하니, 줄여보고 판단.
lr이 초기에 너무 커버리면, 아예 학습이 진행되지 않은 경우가 될 수도 있어서 이게 underfit 개형으로 보이는 것.

train for longer: 더 해보면 될 수도?

Transfer Learning: underfit의 요인 중 하나로 빈약한 representation을 꼽을 수 있고, 이는 이미 representation이 풍부한 pre-trained model에서 가져와서 수정하는 것으로 어느 정도 완화할 수 있음.

use less regularization: regularization 즉, penalty를 너무 tight하게 잡아서 모델의 학습이 지지부진한 걸 수도. 이때에는 오히려 penalty를 완화시켜서 접근.

원본 링크

Overfitting

Overfitting

Summary

Training loss는 계속 줄지만, validation loss은 낮아지지 않음.
train을 계속하면, 모델은 train-set에만 잘 작동하는 함수로 fitting되니,
generalization이 떨어지고, validation-set 또는 unseen-data에 대해서도 loss가 커짐.

Help

To solve the problem, several methods are recommended.

Get more data: 사실상 이게 best. 그러나 cost-issue.
모델한테 패턴을 학습할 기회를 더 주는 것.

Data Augmentation : 데이터를 더 collecting 하는 것 보단 현실적.
train-set의 diversity를 주는 것.

Better Data: low-quality data를 remove.

Transfer Learning: task-suit 하게 준비된 set으로 fine-tune.

Simplify model: 모델의 capacity가 충분해서 train-set의 너무 과한 패턴을 학습한 거니, 모델의 복잡성을 줄여서 generalization performance 확보.

Learning Rate decay: fine-tune은 학습 후반에서 미세한 gradient에서 학습하는 거니, decaying은 후반에서 이러한 것들을 완화해줌.

Early Stopping: overfit 되기 전에 stop.

원본 링크
원본 링크

Juhyeon's Blog

탐색기

RR01. Preliminaries

Types of Learning

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Comparison

Data preparation (DL-workflow)

Building Models (DL-workflow)

Neural Network

Why Do We Need Non-linearity?

Non-Linearity

Numerical Proof

Linearly Separable ← 고민 좀 해보자. 왜 좋냐?

Examples

Activation Functions

ReLU

Sigmoid

tanh

Training Models (DL-workflow)

Evaluating Models (DL-workflow)

For Regression

For Classification

Improving Performance (DL-workflow)

Underfitting

Underfitting

Overfitting

Overfitting

그래프 뷰

Properties

백링크