写在前面

之前已经做过 Andrew Ng 的 CS229了，但是那门课，已经算是比较古早了（2016），课程的作业设计虽然做的非常精致，但基于Matlab语言编写的。现在主流的科研和开发语言，已经是python了。还是离实际算法落地，有一段距离。 CS229(2016)，更多是练习写 toy model，但过了一遍一些机器学习的理论，感觉还是很不错。这门课就当，强化巩固。

This course focuses on Deep Learning

课程网站： ML 2022 spring
课程Github：https://github.com/virginiakm1988/ML2022-Spring
The repository contains code and slides of 15 homeworks for Machine Learning instructed by Hung-yi Lee.
Prof. Lee 把15节课程的录播都已经上传，在油管上会周更一些新的内容。这边基于2022版给予的课程录像，进行学习。加油。

Intro of the courses

HW1: COVID-19 Case Prediction
HW2: Phoneme Classification
HW3: Image Classification
HW4: Speaker Classification
HW5: Machine Translation
HW6: Anime Face Generation

Lecture 7: Self-supervised Learning

用引擎上爬下来的众多unlabel的图片来 pretrain，来获得更好的训练效果

Pre-trained Model(aka. Foundation Model) vs Downstream Tasks

Lecture 6: GAN

Lecture 12: Reinforcement Learning (RL

Lecture 8: Anomaly Detection

Lecture 9: Explainable AI

Lecture 10: Model Attack

Lecture 11: Domain Adaptation

Lecture 12: Network Compression

Lecture 13: Life-long Learning

Lecture 14: Meta Learning

Few-shot learning is usually achieved by meta-learning.

Machine Learning

Angdrew: 一种训练机器的隐式编程。
Hung-yi: Machine LearningLooking for Function.

Different types of Functions

Regression

The func outputs a scalar.

Classification

Given options(classes), the func outputs the correct one.

Structured Learning

create something with structure(image, document)

Pipeline

Function with Unknown Parameters

: no. of views on 2/26,: no. of views on 2/25
andare unknown parameters(learned from data) weight and bias
vector of features.

Define Loss from Training Data

Loss is a func of parameters, to measure how good a set of values is.

例子：频道人数预测（用预测日期前的流量数据作为输入
预测误差，label 和 predic结果之差的一个函数

ifandare both probability distributionsCross-entropy

Optimization

这门课唯一涉及的方法： Gradient Descent

(Randomly) Pick an initial value
Compute

hyperparameter: the parameters given by human being.
Updateiteratively

Gradient Descent 有个问题，就是会收敛到 Local minima

对模型的修改，往往都来自，对于模型都理解（domain knowledge）

Neural Network

Linear models have severe limitation. Model Bias, so we need more sophisticated models.

Piecewise Linear Curves

Continuous curve can be approximated by a piecewise linear curve, need sufficient pieces.

Sigmoid

上面的蓝色Function，叫作 Hard Sigmoid

Vectorization

: feature
Unknown parameters:

update ML pipeline

func with unknown
Loss func

Loss is a func of parameters
Loss means how good a set of values is.

Optimization

(Randomly) Pick initial values
compute gradient:
compute gradient:
compute gradient:

is the Loss func computed from 1st Batch, and then update the parameter.

1 epoch = see all the batches once.
1 update = update the parameters once.

Rectified Linear Unit

Rectified Linear Unit(ReLU)

两个特定的ReLU可以生成一个 sigmoid

Sigmoid and ReLU r named as Activation func

	linear	10 ReLU	100 ReLU
2017-2020	0.32k	0.32k	0.28k
2021	0.46k	0.45k	0.43k

Multiple Layer

Loss for multiple hidden layers
- 100 ReLU for each layer
- input features are the no. of views in the past 56 days

	1 layer	2 layer	3 layer
2017-2020	0.28k	0.18k	0.14k
2021	0.43k	0.39k	0.38k

Deep Learning

Deep Learning 可以替代 Feature Engineering

Fully Connect Feedforward Network

Given network structure, define a function set

当我们写 Neural Network 的式子的时候，一般会把它写成矩阵运算的形式，方便用GPU加速。

Hidden Layers are seen as Feature extractor replacing feature engineering.

一般在做 Neural Network 的时候，Output Layer will be Softmax to implement Multi-class Classifier.

select no. of layers

Q: How many layers? How many neurons for each layer?
Q: Can the structure be automatically determined?
- E.g. Evolutionary Artificial Neural Networks

Universality Theorem

Deep is better?
Any continuous function f

Can be realized by a network with one hidden layer (given enough hidden neurons)

Backpropagation

之前在CS229的时候，做过BP的推算，见文章
https://carp2i.github.io/2022/01/10/ML08/

forward pass

Computefor all parameters

Backpropagation

Computefor all activation func inputs z

Regression

Estimating the CP of a pokemon

CP: the Combat Power

Step1 Model

Linear model:

Step2 Goodness of Function

Training Data: 10pokemons

Loss func:
Loss function是function的function
Input: a func, output: how bad it is

Sum over examples
Estimatedbased on input function

Step3 Best Function

What we really care about is the error on new data(testing data)

Selecting another Model

Best Function

Testing:

If the initial function set perform badly, u should come back to step 1 to Redesign the Model

Redesign

如果你是大木博士的话，会有很多的domain knowlege,所以只能将所有参数都放入model

Regularization

Q: why smooth functions are preferred?
A: If some noises corrupt inputwhen testing, a smoother function has less influence.

Training error: larger lambda, considering the training error less
We prefer smooth function, but don’t be too smooth.

Selectobtaining the best function
when u are preparing for regularization, the bias shouldn’t be taken into account.

Classification

Pokemon Classification

Total：sum of all stats that come after this, a general guide to how strong a pokemon is
HP: hit points, or health, defines how much damage a pokemon can withstand before fainting
Attack: the base modifier for normal attacks(eg. Scratch, Punch)
Defense: the base damage resistance against normal attacks
SP Atk: special attack, the base modifier for special attacks(e.g. fire blast, bubble beam)
SP Def: the base damage resistance against special attacks
Speed: determines which pokemon attacks first each round

How to do Classification

Training data for Classification

Classification as Regression?
Binary classification as example
Training: Class 1 means the target is 1; Class 2 means the target is -1
Testing: closer to 1class 1; closer to -1class 2