03_Hung-yi Lee_What to do if my network fails to train1
Carpe Tu Black Whistle

Local minima & Saddle point

The points with zero gradient is called critical point
image

Local minima

局部最小值,梯度为0,周围的点都比minima大。

Saddle point

鞍点,梯度为0,周围的点,有部分比该点的值大,另一部分比该点的值小。

Tayler Series Approximation

aroundcan be approximated below

Gradient g is a vector

Hessianis a matrix

Hessian

image

At critical point:

For all

= H is positive definite = All eigen values are positive

For all

= H is negative definite = All eigen values are negative.


Some eigen values are positive, and some are negative.

Don’t afraid of saddle point

image

image

Saddle Point v.s. Local Minima

一般我们的模型纬度非常的高,很少会遇到 Local Minima 的情况

但是会停滞在 plateau

image

Batch & Momentum

image

1 epoch = see all the batches onceShuffle after each epoch

Small Batch v.s. Large Batch

consider 20 examples(N = 20)
image
Batch size = N(Full batch)
Update after seeing all the 20 examples
Batch size = 1
Update for each example, Upadate 20 times in an epoch.

  • Larger batch size does not require longer time to compute gradient(unless batch size is too large)

image

  • Smaller batch requires longer time for one epoch(longer time for seeing all data once)

image

image

image

Smaller batch size has better performance
What’s wrong with large batch size? Optimization Fails
“Noisy: update is better for training

Small Batch 可以避免更新陷入 critical point

Overfitting Problem
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

image

image

论文解释说: test set 的分布跟 train set 很不一样,在 Sharp Minima 中 test loss 会很大,而小的 batch 不容易被 Sharp Minima 困住。

image

Batch-size is a hyperparameter u have to decide.

Reference of Large Batch-size Training

Momentum

Gradient Descent + Momentum

一种模拟物理情景的优化器设计

Movement: movement of last step minus gradient at present
Movement not just based on gradient, but previous movement.

image

is the weighted sum of all the previous gradient:



Concluding Remarks

  • Critical points have zero gradients.
  • Critical points can be either saddle points or local minima.
    • Can be determined by the Hessian matrix.
    • It is possible to escape saddle points along the direction of eigenvectors of the Hessian matrix.
    • Local minima may be rare.
  • Smaller batch size and momentum help escape critical points.

Error surface is rugged

Tips for training: Adaptive Learning Rate

Training stuck

一般实验中,critical point 往往不是问题,魔王往往是其他问题

Training stuckSmall Gradient

  • People believe training stuck because the parameters are around a critical point…

image

Training can be difficult even without critical points

image

Adaptive Learning rate

Formulation for one parameter:


image

Adagrad

image

image

同一方向的学习率不能动态改变

RMSProp

找不到论文出自哪里,Hinton在Coursera上开了 Deep learning 的课程,在上面讲过RMS Prop

image


The recent gradient has larger influence, and the past gradients have less influence.

image

Adam: RMSProp + Momentum

image

Learning Rate Scheduling

Learning Rate Decay

As the training goes, we are closer to the destination, so we reduce the learning rate.

Warm Up

image

出现在

  1. Residual Networks https://arxiv.org/abs/1512.03385
  2. Transformer https://arxiv.org/abs/1706.03762

image

Please refer to RAdam https://arxiv.org/abs/1908.03265

Summary of Optimization

(Vanilla) Gradient Descent

Various Improvements

image