RR07. Backpropagation

Backpropagation

Backpropagation

Model training 단계에서 할 일은 결국 최적의(optimal)한 모델의 파라미터를 찾는 것. 현재 모델의 파라미터가 우리의 target과 얼마나 거리가 있는지 나타내는 지표인 Loss를 이용해서 모델 파라미터를 업데이트 하는 과정에서 Gradient를 계산하는 단계를 backpropagation이라 한다.

How to get a optimal parameters?

그렇다면 어떻게 optimal한 parameter를 얻을 수 있을까?
optimal 하다 → target 과 model의 prediction distribution이 비슷하다.
→ Loss 가 작다.
→ Loss landscape에서 global minima point가 우리가 찾고자 하는 optimal point!
Strategy 1 : Random Walk
bestloss = float("inf") # python assigns the highest possible float value.
for num in xrange(1000):
	W = np.random.randn(10, 3073) * 0.0001 # generate random parameter
	loss = L(X_train, Y_train, W) # get the loss over the entire training set.
	if loss < bestloss: # keep track the best solution
		bestloss = loss
		bestW = W
	print 'in attempt %d the loss was %f, best %f' % (num, loss, bestloss)
scores = Wbest.dot(Xte_cols)
Yte_predict = np.argmax(scores, axis=0) # find the index with max score in each cols
np.mean(Yte_predict == Yte)
해보니, 대략 15%라고 하는데, 이 테스트에 사용 된 것은 CIFAR10
따라서 random하게 찍은 경우는 target class가 10개이니, 10%
→ random 보다는 나은 수준이나, SOTA는 99.7%
Strategy 2: Follow the Slope(Gradient Descent)

→ loss landscape에서 경사를 따라 하강하자!
Gradient Descent(GD)

Gradient

Gradient

In 1-dim:
the derivative of function:
$\frac{df ( x )}{d x} = lim_{x \to \infty} \frac{f ( x + h ) - f ( x )}{h}$

In multiple-dim:
the gradient is a vector of (partial derivatives) along each dimension

Direction of gradient

Gradient의 방향 = 값이 가장 가파르게 증가하는 방향.
따라서 gradient-descent는 neg-gradient 방향으로 가야지.

Implementation

Numerical Calculation

optimization의 target인 $W$ 의 gradient를 직접 numerically 계산하는 것은 매우 힘들다.(gradient vector의 ndim 만큼 $d W_{i}$ 만큼씩 계산해야 하니, loop를 너무 많이 돌아야 함.)

approximate

slow

easy to write

Numerical Differentiation

Newton’s difference quotient에 따르면,
$\frac{\partial f ( x )}{\partial x} = lim_{h \to 0} \frac{f ( x + h ) - f ( x )}{h}$
이지만, 아래처럼 대칭으로 계산하는게 더 낫다고 한다. Why??
$\frac{\partial f ( x )}{\partial x} = lim_{h \to 0} \frac{f ( x + h ) - f ( x - h )}{2 h}$

How to choose $h$ ?

For $h = 0$ is undefined
Choose $h$ is trade-off:

Rounding error(finite precision)

Approximation error(wrong)

Elbow: $3 ϵ$ with $ϵ$ the machine precision
ex)

$ϵ = 6 \times 1 0^{- 8}$ . for single precision(32 bits)

$ϵ = 1 \times 1 0^{- 16}$ . for double precision(64 bits)

Analytic Calculation

$L = \frac{1}{L} \sum_{i = 1}^{N} L_{i} + \sum_{k} W_{k}^{2}$
where

Data loss: $\frac{1}{L} \sum_{i = 1}^{N} L_{i}$ ,

Regularization: $\sum_{k} W_{k}^{2}$
라고 할 때, target은 $\nabla_{W} L$

tips!) Loss는 $W$ 의 함수이다.

it is,

exact

fast

error-prone

Gradient check!

항상 gradient를 analytic하게 구할 줄 알아야 하지만, implementation을 numerical 하게 한다는 점을. 그리고 대응시켜본 다는 것을.

원본 링크

Derivative

Derivative

말 그대로 미분 값. 함수의 특정 point에서 독립 변수( $x$ )의 미소 변화에 따라 종속 변수( $f (x)$ )의 변화량.
“For any $y = f (x)$ , expressed as a multiplier $α$ to a tiny increment $Δ x$ to obtain the increments $Δ y$ to the output”
$Δ y = α Δ x$

Scalar function : output = scalar

Scalar function

$f : R^{n} \to R$

$n = 1$

scalar → scalar 이니, $α = \frac{d y}{d x}$ : scalar

$n > 1$

vector → scalar 이니, $α = [α_{1}, α_{2}, \dots, α_{n}]$ : vector
정리해보면, $Δ x$ : vector, $Δ y$ :scalar, $α$ :vector
$Δ y = α_{1} Δ x_{1} + α_{2} Δ x_{2} + \dots + α_{D} Δ x_{D}$
$Δ y = \frac{\partial y}{\partial x _{1}} Δ x_{1} + \frac{\partial y}{\partial x _{2}} Δ x_{2} + \dots + \frac{\partial y}{\partial x _{D}} Δ x_{D}$

$Δ y = \nabla_{x} y Δ x$
where,
$\nabla_{x} y = [\frac{\partial y}{\partial x _{1}}, \dots, \frac{\partial y}{\partial x _{D}}]$
여기서, gradient 정의:
$grdient := \nabla_{x}^{⊤} y$

원본 링크

Hessian

Hessian

Derivative가 일계도 미분 값이라면, Hessian은 이계도 미분.
“For any $y = f (x)$ , expressed as a multiplier $α$ to a tiny increment $Δ x$ to obtain the increments $Δ y$ to the output”
$\begin{bmatrix} \frac{\partial^2 f}{\partial {x_1}^2} & \frac{\partial y}{\partial x_1 \partial {x_2}} & \cdots & \frac{\partial^2 f}{\partial {x_1} \partial {x_n}} & \\ \frac{\partial^2 f}{\partial {x_2} \partial {x_1}} & \frac{\partial y}{\partial {x_2}^2} & \cdots & \frac{\partial^2 f}{\partial {x_2} \partial {x_n}} & \\ \cdots & \cdots & \cdots & \cdots \\ \frac{\partial^2 f}{\partial {x_n} \partial {x_1}} & \frac{\partial y}{\partial {x_n} \partial {x_2}} & \cdots & \frac{\partial^2 f}{\partial {x_n}^2} & \\ \end{bmatrix} \end{equation}$$$

원본 링크

Jacobian

Jacobian

Vector function에 대한 first derivative(matrix)
$\begin{equation} \begin{bmatrix} f_1(x_1, x_2, \cdots, x_n) \\ f_2(x_1, x_2, \cdots, x_n) \\ \cdots \\ f_n(x_1, x_2, \cdots, x_n) \\ \end{bmatrix} \end{equation}$$ $$J(x_1, x_2, \cdots, x_n) := \begin{equation} \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \cdots & \cdots & \cdots & \cdots \\ \frac{\partial f_n}{\partial x_1} & \frac{\partial f_n}{\partial x_2} & \cdots & \frac{\partial f_n}{\partial x_n} \\ \end{bmatrix} \end{equation}$$$

원본 링크
원본 링크