Val

train

Training Set

Model

validation

Validation Set

= {

}

Using validation set to select model = considered as “training by

Your model is{

}

Validation 这件事情，也可以看作是在 Validation Set 上做训练。

Overfitting: 抽到不好的 training data 会造成泛化能力与潜在数据分布不匹配

可选择的 func 越多，模型越复杂越大，模型泛化能力差的几率越大

Hopefullyis small
If youris large, u still have high

Validation Set 引入后，如果待选择的模型，过多，仍然有可能 Overfitting

Deep learning

: 模型的函数集

模型的函数集越大越大，理想的Loss可以更低，但是 理想和现实的gap 比较大
模型的函数集越小越小，理想的Loss越高，但是 gap 比较小

鱼与熊掌兼得的深度学习

深度学习可以使得，的数量很少的同时，损失函数很小

Why Hidden Layer?

理论上：有一个 hidden-layer 的函数模型，可以制造任何形式的函数。

Seide Frank, Gang Li, and Dong Yu. “Conversational Speech Transcription Using Context-Dependent Deep Neural Networks.” Interspeech. 2011.

11年的论文，实验验实了，deep越深，error越小

Fat .vs Deep

同样的参数量下，与其把 Network 变胖，不如 deep 的架构。

Yes, one hidden layer can represent any function.
However, using deep structure is more effective.

Deep learning 的真正强项反而是，更不容易 Overfitting

类比解释

逻辑电路可以构建任意形式的逻辑函数，但是，我们都不会以枚举的方式来构造：因为浪费逻辑门
同理，构建计算机，也会用一些比较精巧的方式。

2. 软件程序的设计也同理，会有大量的自函数复用，以提升开发效率，节省程序内存

3. 折纸的例子

直觉解释

上面是，上课提的一个非常简单直觉的例子，同样生成一个较为复杂的模型，Deep的模型所需要的参数少很多。

deeper is better

Deep networks outperforms shallow ones when the required functions are complex and regular.
Deep is exponentially better than shallow even when.

Spatial Transformer Layer

CNN is not invariant to scalling and rotation
CNN 有一些 translation invariance, 可能是 maxpooling 的关系，人物稍微移动一小部分距离识别上，不会有大碍。

spatial transformer layer: 可以让输入图像进行旋转和缩放，这一层的module也是，神经网络
End-to-end learn(train): spatial transformer layer 与 CNN 的参数可以堆叠在一起进行训练

Can also transform feature map, CNN 的中间层，也可以被视为 image 被 transform.

How to transform an image/feature map

这里的part感觉不是太懂

General Layer:

If we want translate as above:

Image Transformation

Expansion, Compression, Translation

Rotation

affine transformation

6 parameters to describe the affine transformation

如果 affine matrix 有小数时，要对最后的输出结果取四舍五入。
以上的内容时无法使用 gradient descent解

Interpolations

Now we can use gradient descent