03_Hung-yi Lee_What to do if my network fails to train1

Carpe Tu Black Whistle

2022-03-27 19:15:06 2022-03-27 19:15 2022-12-12 11:06:57

Machine Learning

687 Words 3 Mins

Local minima & Saddle point

The points with zero gradient is called critical point

Local minima

局部最小值，梯度为0，周围的点都比minima大。

Saddle point

鞍点，梯度为0，周围的点，有部分比该点的值大，另一部分比该点的值小。

Tayler Series Approximation

aroundcan be approximated below

Gradient g is a vector

Hessianis a matrix

Hessian

At critical point:

For all

= H is positive definite = All eigen values are positive

For all

= H is negative definite = All eigen values are negative.

Some eigen values are positive, and some are negative.

Don’t afraid of saddle point

Saddle Point v.s. Local Minima

一般我们的模型纬度非常的高，很少会遇到 Local Minima 的情况

但是会停滞在 plateau

Batch & Momentum

1 epoch = see all the batches onceShuffle after each epoch

Small Batch v.s. Large Batch

consider 20 examples(N = 20)

Batch size = N(Full batch)
Update after seeing all the 20 examples
Batch size = 1
Update for each example, Upadate 20 times in an epoch.

Larger batch size does not require longer time to compute gradient(unless batch size is too large)

Smaller batch requires longer time for one epoch(longer time for seeing all data once)

Smaller batch size has better performance
What’s wrong with large batch size? Optimization Fails
“Noisy: update is better for training

Small Batch 可以避免更新陷入 critical point

Overfitting Problem
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

论文解释说： test set 的分布跟 train set 很不一样，在 Sharp Minima 中 test loss 会很大，而小的 batch 不容易被 Sharp Minima 困住。

Batch-size is a hyperparameter u have to decide.

Reference of Large Batch-size Training

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes(https://arxiv.org/abs/1904.00962)
Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes(http://arxiv.org/abs/1711.04325)
Stochastic Weight Averaging in Parallel: Large-Batch Training That generalizes Well(https://arxiv.org/abs/2001.02312)
Large Batch Training of Convolutional Networks(http://arxiv.org/abs/1708.03888)
Accurate, large minibatch sgd: Training imagenet in 1 hour(https://arxiv.org/abs/1706.02677)

Momentum

Gradient Descent + Momentum

一种模拟物理情景的优化器设计

Movement: movement of last step minus gradient at present
Movement not just based on gradient, but previous movement.

is the weighted sum of all the previous gradient:

Concluding Remarks

Critical points have zero gradients.
Critical points can be either saddle points or local minima.
- Can be determined by the Hessian matrix.
- It is possible to escape saddle points along the direction of eigenvectors of the Hessian matrix.
- Local minima may be rare.
Smaller batch size and momentum help escape critical points.

Error surface is rugged

Tips for training: Adaptive Learning Rate

Training stuck

一般实验中，critical point 往往不是问题，魔王往往是其他问题

Training stuckSmall Gradient

People believe training stuck because the parameters are around a critical point…

Training can be difficult even without critical points

Adaptive Learning rate

Formulation for one parameter:

Adagrad

同一方向的学习率不能动态改变

RMSProp

找不到论文出自哪里，Hinton在Coursera上开了 Deep learning 的课程，在上面讲过RMS Prop

The recent gradient has larger influence, and the past gradients have less influence.

Adam: RMSProp + Momentum

Learning Rate Scheduling

Learning Rate Decay

As the training goes, we are closer to the destination, so we reduce the learning rate.

Warm Up

出现在

Residual Networks https://arxiv.org/abs/1512.03385
Transformer https://arxiv.org/abs/1706.03762

Please refer to RAdam https://arxiv.org/abs/1908.03265

Summary of Optimization

(Vanilla) Gradient Descent

Various Improvements

Post title: 03_Hung-yi Lee_What to do if my network fails to train1
Post author: Carpe Tu
Create time: 2022-03-27 19:15:06
Post link: Machine-Learning/03-hung-yi-lee-what-to-do-if-my-network-fails-to-train1/
Copyright notice: All articles in this blog are licensed under BY-NC-SA unless stating additionally.