这一节，Andrew在这一节，避开了 partial derivative(偏导数)的计算，只给出了比较直观的数学解释。
推荐知乎文章：吴恩达机器学习：神经网络 | 反向传播算法
详尽地推导了反向传播算法

Cost Function

Defination

total number of layers in the network
number of units(not counting the bias unit) in layer l
no. of output units(classes)

Cost Function of Neural Networks

Backpropogation Algorithm

Function Building

Object

Cost Function J

Partial Derivative

Algorithm

公式(1)

公式(2)

公式(3)

==来自以上知乎链接== 帮助更好的理解
纠正：
最后一行

Given training set {}

Setfor all, (having a matrix full of zeros)
For training example:

Set
Perform forward propagation to computefor
Using, compute
Where L is our total number of layers andis the vector of outputs of the activation units for the last layer. So our “error values” for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y.
Computeusing
The delta values of layer l are calculated by multiplying the delta values in the next layer with the theta matrix of layer l. We then element-wise multiply that with a function called g’, or g-prime, which is the derivative of the activation function g evaluated with the input values given by.

The g-prime derivative terms can also be written out as:
这一项对对的求导

or with vectorization,
update thematrix:

, if
, if

The capital-delta matrix D is used as an “accumulator“ to add up our values as we go along and eventually compute our partial derivative. Thus we get:

Mathematics intuition

we Introduce an intermediate variable:

输出层误差

Chain Rule
当且仅当,

上式当是通过偏微分运算和Chain Rule推导出来的，不是简单的两向量相减

隐藏层误差

Chain Rule
神经网络的结点运算

求偏导得：

Chain Relu

当且仅当：