Nonlinearity

복잡한 현실 세계의 복잡한 함수를 다루기 위해

matrix의 matmul이 결국 linear하니까, non-linear 함수를 중간에 끼지 않는 이상 결국은 하나의 큰 선형 시스템을 벗어날 수 없음. 세상이 선형함수로만 설명 가능할리 없잖아,,
→ 선형 연산을 거듭해도 선형.

Numerical Proof

Assume that,

$f = W_{2} \cdot max (0, W_{1} \cdot x)$
where $x \in R^{D}$ , $W_{1} \in R^{H \times D}$ , $W_{2} \in R^{C \times H}$

ReLU

$ReLU (x) = max (0, x)$
one of non-linear activation

In this time, it is impossible to distinguish below 2 models,

$f = W_{2} \cdot W 1 \cdot x$
$f = W_{3} \cdot x$
where $W_{3} \in R^{C \times H}$

→ 결국 linear
→ 그러니 non-linear function을 activation으로 사용하자!!

Linearly Separable ← 고민 좀 해보자. 왜 좋냐?

Important

non-linear map은 data를 linearly-separable하게 바꿀 수 있다.

Examples

Activation Functions

나중에 transformer 계열에 최근 가장 많이 사용되는 GeLU 계열이나 찾아볼 것.

Summary

No one-size-fits-all: Choice of activation function depends on problem.

We only showed the most common ones, there exist many more

Best activation function / model is often using trial-and-error in practice

It is important to ensure a good “gradient flow” during optimization

Rule of Thumb

Use ReLU by default(with small enough lr)

Try Leaky ReLU, Maxout, ELU for some small additional gain

Prefer Tanh over sigmoid(Tanh often used in RNNs)
ReLU

ReLU(Rectified Linear Unit)

activation 중 하나로 대표적으로 많이 사용됨.
$g (x) = max (x, 0)$

Does not saturate(for $x > 0$ )

Leads to fast convergence

Computationally efficient

Problems

No learning for $x < 0$ → dead/dying ReLU

downstream gradient가 0(input이 0 이하일 때,)

often initialize with pos. bias ( $b > 0$ )

Outputs are not zero-centered → introduces bias after the layer

sigmoid와 그 gradient는 항상 positive이기 때문에 model wight의 bias

$y = g (ax + b)$ 라 할 때, $sgn (\partial L / \partial a_{i})$ = $sgn (\partial L / \partial g)$

모든 gradient는 동일한 부호를 가짐.

다음과 같이 구현.
ReLU
def relu(x: float) -> float:
	return np.maximum(0, x)
단점들 보안을 위해 Leaky ReLU 등이 있음.

How can we calculate non-differentiable function (like ReLU)?

→ Subgradient

원본 링크
Sigmoid

Sigmoid

activation 중 하나로 대표적으로 많이 사용됨.
$g (x) = \frac{1}{1 + exp ( - x )}$

range: $[0, 1]$

Neuroscience interpretation as saturating “firing rate” of neurons

Problems

Saturation “kills” gradient : 입력 값의 절댓값이 커질수록 기울기 소실

특히나 layer가 깊어질수록, chain-rule에 의해 곱해지는 term들이 많아질텐데, 그러면 앞 쪽에 있는 layer 들일수록 learning되지 않을 수 있지.

Outputs are not zero-centered → introduces bias after the layer

sigmoid와 그 gradient는 항상 positive이기 때문에 model wight의 bias

$y = g (ax + b)$ 라 할 때, $sgn (\partial L / \partial a_{i})$ = $sgn (\partial L / \partial g)$

모든 gradient는 동일한 부호를 가짐.

다음과 같이 구현.
Sigmoid
def sigmoid(x: float) -> float:
	return 1 / (1 + np.exp(-x))
원본 링크
tanh

tanh(hyperbolic tangent)

activation 중 하나로 대표적으로 많이 사용됨.
$g (x) = \frac{2}{1 + exp ( - 2 x )} - 1$

range: $[- 1, 1]$

anti-symmetric

zero-centered

Problems

Saturation “kills” gradient : 입력 값의 절댓값이 커질수록 기울기 소실

특히나 layer가 깊어질수록, chain-rule에 의해 곱해지는 term들이 많아질텐데, 그러면 앞 쪽에 있는 layer 들일수록 learning되지 않을 수 있지.

다음과 같이 구현.
tanh
def tanh(x: float) -> float:
	return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))
원본 링크
원본 링크

Juhyeon's Blog

탐색기

Non-Linearity

Numerical Proof

Linearly Separable ← 고민 좀 해보자. 왜 좋냐?

Examples

Activation Functions

ReLU

Sigmoid

tanh

그래프 뷰

목차

Properties

백링크