ReLU

ReLU(Rectified Linear Unit)

activation 중 하나로 대표적으로 많이 사용됨.
$g (x) = max (x, 0)$

Does not saturate(for $x > 0$ )

Leads to fast convergence

Computationally efficient

Problems

No learning for $x < 0$ → dead/dying ReLU

downstream gradient가 0(input이 0 이하일 때,)

often initialize with pos. bias ( $b > 0$ )

Outputs are not zero-centered → introduces bias after the layer

sigmoid와 그 gradient는 항상 positive이기 때문에 model wight의 bias

$y = g (ax + b)$ 라 할 때, $sgn (\partial L / \partial a_{i})$ = $sgn (\partial L / \partial g)$

모든 gradient는 동일한 부호를 가짐.

다음과 같이 구현.

ReLU

def relu(x: float) -> float:
	return np.maximum(0, x)

단점들 보안을 위해 Leaky ReLU 등이 있음.

How can we calculate non-differentiable function (like ReLU)?

→ Subgradient

Juhyeon's Blog

탐색기

ReLU

그래프 뷰

Properties

백링크