Stochastic Gradient Descent

Gradient Descent(GD) 의 발전된 버전.
기존 GD는 한 iteration에 모든 데이터 포인트가 다 쓰였는데(epoch = iteration) 이를 줄여서 1 epoch에 여러 번 iteration이 되게 함. 즉, mini-batch 처리.

Why we use instead of GD?

일반적으로, train-set은 너무 큰데,

이를 한 번에 다 메모리에 올리는 건 버겁고,

gradient 계산에 너무 오래 걸림.

또한, subtoptimal problem을 피할 확률도 올라감.

GD는 gradient 방향이 단방향인데, SGD는 batch마다 다른 방향의 gradient 확보.

Check

1 Iteration: gradient가 계산되고, 1번 업데이트.

1 Epoch: 전체 데이터가 1번 사용됨.

즉, mini-batch SGD를 사용한다면, 1 epoch 당 int(data_size // batch-size) + 1 번 iteration 이 일어남.

batch-size: 1 batch에 들어간 data point 수.

일반적으로 2의 배수 32, 64, 128, …

Vanilla Mini-batch GD (SGD)
여기는 같은데,
$w^{0} = w^{init}$
$w^{t + 1} = w^{t} - η \nabla_{w} L (w^{t})$
Loss를 구하는 과정이 조금 다름.

$L (W) = \frac{1}{N} \sum_{i = 1}^{N} L_{i} (x_{i}, y_{i}, W) + λ R (W)$
$\nabla_{W} L (W) = \frac{1}{N} \sum_{i = 1}^{N} \nabla_{W} L_{i} (x_{i}, y_{i}, W) + λ \nabla_{W} R (W)$
while True:
	data_batch = sample_training_data(data, 256) # sample 256 examples
	weights_grad = evalutae_gradient(loss_func, data_batch, weights)
	weights += - step_size * weights_grad # perform parameter update

Algorithm

Initialize $w^{0}$ , pick the lr $η$ and mini-batch size $∣ χ_{batch} ∣$

random mini-batch 잡기. $χ \supseteq {(x_{1}, y_{1}), \dots, (x_{B}, y_{B})}$ (with $B << N$ )

모든 mini-batch에 대해 $b / in 1, 2, \dots, B$ 에 대해, do:

Forward-pass $x_{b}$ and, get the $\overset{y}{^}_{b}$

Backward-pass, to get gradient $\nabla_{w} L_{b} (\overset{y}{^}_{b}, y_{b}, w^{t})$

Update gradients: $w^{t + 1} = w^{t} - η \frac{1}{B} \sum_{b} \nabla_{w} L_{b} (w^{t})$

Validation Error가 커지면, step2로 가져 again. otherwise stop.

Remark

일반적으로, gpu 자원량보다 전체 train-set이 크기 때문에 mini-batch 처리를 사용한다.(chunking)

일반적으로, batch-size는 gpu가 허용하는 한, 크게 잡는다. (Goyal et al., 2017)

작은 batch-size는 gradient의 variance를 키운다. (noisy update)

Batch들은 random shuffle하거나, partitioning한다.

Introduces stochasticity, batch gradients approximate the true gradient

SGD gradient → GD gradient (prove)

Convergence of SGD

Definition of the Convergence of Series

A series is the sum of therms of an infinite sequence of numbers $(a_{1}, a_{2}, ...)$ :
$s_{n} = \sum_{k = 1}^{n} a_{k}$ where $n \to \infty$

A series is convergent if there exists a number $s^{*}$ such that for every small positive number $ϵ$ , there exists an integer $N$ such that for all $n \geq N$ :
$∣ s_{n} - s^{*} ∣ < ϵ$

이걸 SGD update에 적용해보면,

Problems with SGD

Gradient Jittering

그림과 같이 gradient가 한 dimension으로만 커서, iteration이 진행되면서, gradient가 큰 쪽은 진동, 작은 쪽은 느리게 진행. 즉, ideal한 방향으로 SGD가 진행되지 않는다. → Momentum 등으로 해결.

Local minima or Saddle point

Gradient Descent(GD)와 마찬가지로, local minima를 극복할 수 없음.

Example

아래와 같은 함수(여기서는 $loss = f (W)$ 에 대응되겠지?)에서 saddle point?

저기 정 가운데 gradient 딱봐도 0인 곳.(계산해보기!)

Noisy Gradient

앞에서 언급한 장점이 단점으로도 작용할 수도 있다.
gradient를 계산하는데 모든 train-set이 한 번에 반영된 게 아니라, 한 번 gradient 계산 시에 batch 단위로 고려되다보니, “아주 운이 없게” batch에 좋지 못한 데이터들만 들어간다면, gradient가 완전 튀는 경우까지 발생할 수 있겠지.

Problems of SGD

Gradient scaled equally across all dimensions.

Requires conservative learning rate to avoid divergence

However, in this case the updates become very small → slow progress

Finding a good lr is difficult

Juhyeon's Blog

탐색기

Stochastic Gradient Descent(SGD)

SGD gradient → GD gradient (prove)

Convergence of SGD

Problems with SGD

Gradient Jittering

Local minima or Saddle point

Noisy Gradient

그래프 뷰

목차

Properties

백링크