反向传播

反向传播

预备

假设样本为\(\left\{ \left( \pmb{x}_1 , \pmb{y}_1 \right), \left( \pmb{x}_2 , \pmb{y}_2 \right), \dots , \left( \pmb{x}_n , \pmb{y}_n \right) \right\}\)\(\pmb a^i_j\)为第\(i\)层第\(j\)个神经元的结果, \(\pmb z^i_j\)为第\(i\)层第\(j\)个神经元激活函数的输入, \(\pmb w^i_j\)为第\(i\)层第\(j\)个神经元的权重组成的向量, \(b^i_j\)为第\(i\)层第\(j\)个神经元的偏置量。\(l+1\)代表神经元的层数(第\(0\)层为输入层,第\(l\)层为输出层,其他为隐藏层),\(l_i\)代表第\(i\)层的神经元个数。
\(h\left( x \right) = \frac{1}{1+e^{-x}}\) , \(\pmb h(\pmb x) = \pmb h\left( \begin{bmatrix} x_1 \\ x_2\\ \vdots \\ x_n \end{bmatrix}\right) = \begin{bmatrix} h(x_1) \\ h(x_2)\\ \vdots \\ h(x_n) \end{bmatrix}\)

\(\pmb A^0 = \begin{bmatrix} \pmb 1\\ \pmb{x}_1\\ \pmb{x}_2\\ \vdots \\ \pmb{x}_n \end{bmatrix}\), \(\pmb A^i = \begin{bmatrix} \pmb 1\\ \pmb{a}^i_1 \\ \pmb{a}^i_2\\ \vdots \\ \pmb{a}^i_n \end{bmatrix}\),
\(\pmb{z}^{i}_{j} = \left( \pmb A^{i-1}\right)^T \pmb w^{i}_{j} + b^i_j \pmb 1\)\(\pmb a^i_j = \pmb h(\pmb z ^i_j)\)

image

损失函数

\[J(\pmb{W}, \pmb{b})=\frac{1}{N} \sum_{n=1}^{N} \mathcal{L}\left(\pmb{y}^{n}, \hat{\pmb{y}}^{n}\right) \]

由于上式是一个累加,对求导有分配律,这样我们对可以每个样例分开求损失,最后将其累加。比如我们拿出一个样例设作\(\left( \pmb{x}, \pmb{y} \right)\)预测结果为\(\hat{\pmb y}\)(这时是\(\pmb A\)\(l\times 1\times p\)),来进行反向传播。

反向传播

我们先尝试求一个最简单的比如先求第\(l\)层的第i个神经元的参数对损失的导数\(\frac{\partial J}{\partial \pmb w^l}\),由于\(\pmb{A}^{l}_{j} = \pmb h\left(\left( \pmb A^{l-1}\right)^T \pmb w^{l}_{j} + b^l_j \pmb 1\right)\),这时\(\frac{\partial J}{\partial \pmb w^{l}} = \frac{\partial \pmb z^{l}}{\partial \pmb w^l}\frac{\partial \pmb A^{l}_i}{\partial \pmb z^{l}}\frac{\partial J}{\partial \pmb A^{l}_i}\),很容易得到,但若求第\(l-1\)层的第i个神经元的参数对损失的导数呢?\(\frac{\partial J}{\partial \pmb w^{l}} = \frac{\partial \pmb z^{l}}{\partial \pmb w^l} \frac{\partial \pmb A^{l-1}_j}{\partial \pmb z^{l}} \frac{\partial \pmb z^{l}}{\partial \pmb A^{l-1}_j} \frac{\partial \pmb A^{l}_i}{\partial \pmb z^{l}} \frac{\partial J}{\partial \pmb A^{l}_i}\)这里需要把后面那一块重新求一边。这就是为什么要用反向传播算法。
image
如图,我们将箭头的反方向就是反向传播的方向。为了较少时间我们可以单独设一个变量单独存每一层的\(z\)对损失函数的导数,我们用\(\delta^i = \frac{\partial J}{\partial \pmb z^i}\),然后通过乘上\(\frac{\partial z ^i}{ \partial W}\)\(\frac{\partial z ^i}{ \partial b}\)得到最终结果。
假设我们已经求出\(\delta^l = \frac{\partial J}{\partial \pmb z^l}\),而\(\delta^{l-1} = \frac{\partial \pmb a^l}{\partial \pmb z^{l-1}}\frac{\partial \pmb z^l}{\partial \pmb a^l}\frac{\partial J}{\partial \pmb z^l} = \frac{\partial \pmb a^l}{\partial \pmb z^{l-1}}\frac{\partial \pmb z^l}{\partial \pmb a^l }\delta^{l}\)这样我们就得到了一个反向的递推关系

\[{\Large \begin{aligned} \delta^{i} &= \frac{\partial \mathcal{L}(\pmb{y}, \hat{\pmb{y}})}{\partial \pmb{a}^{i}} \\ &=\frac{\partial \pmb{z}^{i}}{\partial \pmb{a}^{i}} \cdot \frac{\partial \pmb{a}^{i+1}}{\partial \pmb{z}^{i}} \cdot \frac{\partial \mathcal{L}(\pmb{y}, \hat{\pmb{y}})}{\partial \pmb{a}^{i+1}} \\ &=\operatorname{diag}\left(f_{l}^{\prime}\left(\pmb{a}^{i}\right)\right) \cdot\left(\pmb{W}^{i+1}\right)^T \cdot \delta^{i+1} \\ &=f_{l}^{\prime}\left(\pmb{a}^{l}\right) \odot\left(\left(\pmb{W}^{i+1}\right)^T \delta^{l+1}\right) \end{aligned}} \]

\(\pmb a^i_j = \hat{\pmb y}^j\)
\(\frac{\partial J}{\partial a^l_j} = \frac{\partial \mathcal{L}(\pmb{y}, \hat{\pmb{y}})}{\partial a^l_j} = \frac{\partial \mathcal{L}(\pmb{y}, \hat{\pmb{y}})}{\partial \hat{\pmb y}}\)后面这个偏导,只需要知道损失函数,就能求出来。
求对某一层参数的偏导只需要在图中分叉的部分的\(\delta\)乘上其对系数的偏导。

\[{\LARGE \begin{array}{rcl} \frac{\partial \mathcal{L}(\pmb{y}, \hat{\pmb{y}})}{\partial w_{i j}^{l}}&=&a_{1j}^{l-1} \delta^{l}_i\\ \frac{\partial \mathcal{L}(\pmb{y}, \hat{\pmb{y}})}{\partial \pmb{b}^{l}_i}&=&\delta^{l}_i \end{array}}\]

这时候我们对于将每个样本先正向求损失,然后逆向的求每个\(\theta^i_{jk}\)的偏导,前面说个损失是个和式,满足分配律。我们先求导再累加一样可以求出最终导数。我们假设损失这一部分的导数为\(\Delta^i_{jk}\),令其初始值为零,然后一直累加即可。

\[\Delta_{jk}^{i}:=\Delta_{jk}^{i}+a_{k}^{i} \delta_{j}^{i+1} \]

这时候加上正则化的的偏导就是最终的导数。
整理得

\[\begin{array}{ll} \frac{\partial J}{\partial \pmb w^i_{jk}}=\frac{1}{n} \Delta_{jk}^{i}+\lambda \Theta_{jk}^{i} &\text { if } k \neq 0 \\ \frac{\partial J}{\partial \pmb b^i_{j}}=\frac{1}{n} \Delta_{jk}^{i} &\text { if } k=0 \end{array}\]

posted @ 2022-10-19 20:34  孑然520  阅读(113)  评论(0)    收藏  举报