AdaGrad?
Stochastic Gradient Descent(SGD)에서 gradient가 dimension 별로 너무 차이가 크다면, 이를 맞춰주자!
→AdaGrad(Adaptive, Gradient?)
Code
epsilon = 1e-7 grad_squared = 0 while True: dx = compute_gradient(x) grad_squared += dx * dx x -= learning_rate * dx / (np.sqrt(grad_squared)+ epsilon)
- added element-wise scaling of the gradient based on the historical sum of squared in each dimension
- “per-parameter learning rates” or “adaptive lr”
Example
Q1: What happens with AdaGrad?
- “steep” direction은 damped.
- “flat” direction은 accelerated.
- 루트 들어간 분모항을 관찰!
Q2: What happend to the step size over long time?
- Decay to 0.
→ 이 문제를 해결한 게 RMSProp
