Local minima & Saddle point
The points with zero gradient is called critical point
Local minima
局部最小值,梯度为0,周围的点都比minima大。
Saddle point
鞍点,梯度为0,周围的点,有部分比该点的值大,另一部分比该点的值小。
Tayler Series Approximation
Gradient g is a vector
Hessian
Hessian
At critical point:
For all
= H is positive definite = All eigen values are positive
For all
= H is negative definite = All eigen values are negative.
Some eigen values are positive, and some are negative.
Don’t afraid of saddle point
Saddle Point v.s. Local Minima
一般我们的模型纬度非常的高,很少会遇到 Local Minima 的情况
但是会停滞在 plateau
Batch & Momentum
1 epoch = see all the batches once
Small Batch v.s. Large Batch
consider 20 examples(N = 20)
Batch size = N(Full batch)
Update after seeing all the 20 examples
Batch size = 1
Update for each example, Upadate 20 times in an epoch.
- Larger batch size does not require longer time to compute gradient(unless batch size is too large)
- Smaller batch requires longer time for one epoch(longer time for seeing all data once)
Smaller batch size has better performance
What’s wrong with large batch size? Optimization Fails
“Noisy: update is better for training
Small Batch 可以避免更新陷入 critical point
Overfitting Problem
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
论文解释说: test set 的分布跟 train set 很不一样,在 Sharp Minima 中 test loss 会很大,而小的 batch 不容易被 Sharp Minima 困住。
Batch-size is a hyperparameter u have to decide.
Reference of Large Batch-size Training
- Large Batch Optimization for Deep Learning: Training BERT in 76 minutes(https://arxiv.org/abs/1904.00962)
- Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes(http://arxiv.org/abs/1711.04325)
- Stochastic Weight Averaging in Parallel: Large-Batch Training That generalizes Well(https://arxiv.org/abs/2001.02312)
- Large Batch Training of Convolutional Networks(http://arxiv.org/abs/1708.03888)
- Accurate, large minibatch sgd: Training imagenet in 1 hour(https://arxiv.org/abs/1706.02677)
Momentum
Gradient Descent + Momentum
一种模拟物理情景的优化器设计
Movement: movement of last step minus gradient at present
Movement not just based on gradient, but previous movement.
Concluding Remarks
- Critical points have zero gradients.
- Critical points can be either saddle points or local minima.
- Can be determined by the Hessian matrix.
- It is possible to escape saddle points along the direction of eigenvectors of the Hessian matrix.
- Local minima may be rare.
- Smaller batch size and momentum help escape critical points.
Error surface is rugged
Tips for training: Adaptive Learning Rate
Training stuck
一般实验中,critical point 往往不是问题,魔王往往是其他问题
Training stuck
- People believe training stuck because the parameters are around a critical point…
Training can be difficult even without critical points
Adaptive Learning rate
Formulation for one parameter:
Adagrad
同一方向的学习率不能动态改变
RMSProp
找不到论文出自哪里,Hinton在Coursera上开了 Deep learning 的课程,在上面讲过RMS Prop
The recent gradient has larger influence, and the past gradients have less influence.
Adam: RMSProp + Momentum
Learning Rate Scheduling
Learning Rate Decay
As the training goes, we are closer to the destination, so we reduce the learning rate.
Warm Up
出现在
- Residual Networks https://arxiv.org/abs/1512.03385
- Transformer https://arxiv.org/abs/1706.03762
Please refer to RAdam https://arxiv.org/abs/1908.03265
Summary of Optimization
(Vanilla) Gradient Descent
Various Improvements
- Post title: 03_Hung-yi Lee_What to do if my network fails to train1
- Create time: 2022-03-27 19:15:06
- Post link: Machine-Learning/03-hung-yi-lee-what-to-do-if-my-network-fails-to-train1/
- Copyright notice: All articles in this blog are licensed under BY-NC-SA unless stating additionally.