【循环神经网络2】长短期记忆网络LSTM详解 - 指南 - slgkaifa

【循环神经网络2】长短期记忆网络LSTM详解 - 指南

1 LSTM的产生背景

大家首先了解一下长短期记忆网络产生的背景。先回顾一下【循环神经网络1】一文搞定RNN入门与详解-CSDN博客中推导的误差项沿时间反向传播的公式：

$\delta_k^\top = \delta_t^\top \prod_{i=k}^{t-1} \text{diag}[f'(\mathbf{z}_i)] \mathbf{W}$

我们可以根据下面的不等式，来获取 $\delta_k^\top$ 通过的模的上界（模能够看做对 $(\delta_k)^\top$ 中每一项值的大小的度量）：

$\|\delta_k^\top| \leq |\delta_t^\top| \prod_{i=k}^{t-1} \|\text{diag}[f'(\mathbf{z}_i)]\| \|\mathbf{W}\| \leq |\delta_t^\top| (\beta_f \beta_\mathbf{W})^{t-k}$

我们可以看到，误差项 $\delta$ 从 t 时刻传递到 k 时刻，其值的上界是 $\beta_f \beta_\mathbf{W}$ 的指数函数。 $\beta_f \beta_\mathbf{W}$ 分别是对角矩阵 $\text{diag}[f'(\text{net}_i)]$ 和矩阵 $\mathbf{W}$ 模的上界。显然，除非 $\beta_f \beta_\mathbf{W}$ 乘积的值位于 1 附近，否则，当 t-k 很大时（也就是误差传递很多个时刻时），整个式子的值就会变得极小（当 $\beta_f \beta_\mathbf{W}$ 乘积小于1）或者极大（当 $\beta_f \beta_\mathbf{W}$ 乘积大于1），前者就是梯度消失，后者就是梯度爆炸。虽然科学家们搞出了很多技巧（比如怎样初始化权重），让 $\beta_f \beta_\mathbf{W}$ 的值尽可能贴近于1，终究还是难以抵挡指数函数的威力。

梯度消失到底意味着什么？在【循环神经网络1】一文搞定RNN入门与详解-CSDN博客中我们已证明，权重数组 $\mathbf{W}$ 各个时刻的梯度之和，即：就是最终的梯度

$\nabla_{\mathbf{W}} E = \sum_{k=1}^{t} \nabla_{\mathbf{W}_k} E$

假设从 t-3 时刻开始，梯度已经几乎减少到0了。那么，从这个时刻开始再往之前走，得到的梯度（几乎为零）就不会对最终的梯度值有任何贡献，这就相当于无论 t-3 时刻之前的网络状态 h 是什么，在训练中都不会对权重数组 $\mathbf{W}$ 的更新产生影响，也就是网络事实上已经忽略了 t-3 时刻之前的状态。这就是原始RNN无法处理长距离依赖的原因。

既然找到了问题的原因，那么我们就能解决它。从问题的定位到解决，科学家们大概花了7、8年时间。终于有一天，Hochreiter和Schmidhuber两位科学家发明出长短时记忆网络，一举解决这个困难。

2 LSTM的结构和概念

其实，长短时记忆网络的思路比较简单。原始RNN的隐藏层只有一个状态，即 $h$ ，它对于短期的输入相当敏感。那么，假如我们再增加一个状态，即 $c$ ，让它来保存长期的状态，那么问题不就解决了么？如下图所示：

新增加的状态 $c$ ，称为单元状态(cell state)。我们把上图按照时间维度展开：

上图仅仅是一个示意图，我们可能看出，在 $t$ 时刻，LSTM的输入有三个：当前时刻网络的输入值 $\mathbf{x}_{t}$ 、上一时刻LSTM的输出值 $\mathbf{h}_{t-1}$ 、以及上一时刻的单元状态 $\mathbf{c}_{t-1}$ ；LSTM的输出有两个：当前时刻LSTM输出值 $\mathbf{h}_{t}$ 、和当前时刻的单元状态 $\mathbf{c}_{t}$ 。注意 $\mathbf{x},\mathbf{h},\mathbf{c}$ 都是向量。

LSTM的关键，就是怎样控制长期状态 $c$ 使用三个控制开关。就是。在这里，LSTM的思路

第一个开关，负责控制继续保存长期状态 $c$ ；
第二个开关，负责控制把即时状态输入到长期状态 $c$ ；
第三个开关，负责控制是否把长期状态 $c$ 作为当前的LSTM的输出。

3 LSTM的前向计算

3.1 开关的实现（门）

前面描述的开关是怎样在算法中建立的呢？这就用到了门（gate）一层就是的概念。门实际上就全连接层一个0到1之间的实数向量。假设就是，它的输入是一个向量，输出 $\mathbf{W}$ 是门的权重向量， $\mathbf{b}$ 是偏置项，那么门可以表示为：

$g(\mathbf{x}) = \sigma(\mathbf{Wx} + \mathbf{b})$

门的使用，就是用门的输出向量按元素乘以我们需要控制的那个向量。因为门的输出是0到1之间的实数向量，那么，当门输出为0时，任何向量与之相乘都会得到0向量，这就相当于啥都不能通过；输出为1时，任何向量与之相乘都不会有任何改变，这就相当于啥都可以通过。源于 $\sigma$ （也就是sigmoid函数）的值域是(0,1)，所以门的状态都是半开半闭的。

3.2 三种门的介绍

LSTM用两个门来控制单元状态 $c$ 的内容：
- 一个是遗忘门（forget gate），它决定了上一时刻的单元状态 $\mathbf{c}_{t-1}$ 有多少保留到当前时刻 $\mathbf{c}_t$
- 另一个是输入门（input gate），它决定了当前时刻网络的输入 $\mathbf{x}_t$ 有多少保存到单元状态 $\mathbf{c}_t$
LSTM用输出门（output gate）来控制单元状态 $\mathbf{c}_t$ 有多少输出到LSTM的当前输出 $\mathbf{h}_t$

2.3.2.1 遗忘门

$\mathbf{f}_t = \sigma(\mathbf{W}_f \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f) \quad (1)$

$\mathbf{W}_f$ 是遗忘门的权重矩阵
$[\mathbf{h}_{t-1}, \mathbf{x}_t]$ 表示把两个向量连接成一个更长的向量
$\mathbf{b}_f$ 是遗忘门的偏置项
$\sigma$ 是sigmoid函数

如果输入的维度是 $d_x$ ，隐藏层的维度是 $d_c$ ，单元状态的维度是 $d_h$ （通常 $d_c = d_h$ ），则遗忘门的权重矩阵 $\mathbf{W}_f$ 维度是 $d_c \times (d_h + d_x)$ 。事实上，权重矩阵 $\mathbf{W}_f$ 两个矩阵拼接而成的：一个是就是都 $\mathbf{W}_{fh}$ ，它对应着输入项 $\mathbf{h}_{t-1}$ ，其维度为 $d_c \times d_h$ ；一个是 $\mathbf{W}_{fx}$ ，它对应着输入项 $\mathbf{x}_t$ ，其维度为 $d_c \times d_x$ 。用矩阵分解表示就是：

$\begin{bmatrix} \mathbf{W}_f \end{bmatrix} \begin{bmatrix} \mathbf{h}_{t-1} \\ \mathbf{x}_t \end{bmatrix} = \begin{bmatrix} \mathbf{W}_{fh} & \mathbf{W}_{fx} \end{bmatrix} \begin{bmatrix} \mathbf{h}_{t-1} \\ \mathbf{x}_t \end{bmatrix} = \mathbf{W}_{fh} \mathbf{h}_{t-1} + \mathbf{W}_{fx} \mathbf{x}_t$

遗忘门图示如下：

2.3.2.2 输入门：

$\mathbf{i}_t = \sigma(\mathbf{W}_i \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i) \quad (2)$

上式中， $\mathbf{W}_i$ 是输入门的权重矩阵， $\mathbf{b}_i$ 是输入门的偏置项。下图表示了输入门的计算：

接下来，我们计算用于描述当前输入的单元状态 $\mathbf{\tilde{c}}_t$ ，它是根据上一次的输出和本次输入来计算的：

$\mathbf{\tilde{c}}_t = \tanh(\mathbf{W}_c \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c) \quad (3)$

下图是 $\mathbf{\tilde{c}}_t$ 的计算：

现在，我们计算当前时刻的单元状态 $\mathbf{c}_t$ 。它是由上一次的单元状态 $\mathbf{c}_{t-1}$ 按元素乘以遗忘门 $\mathbf{f}_t$ ，再用当前输入的单元状态 $\mathbf{\tilde{c}}_t$ 按元素乘以输入门 $\mathbf{i}_t$ ，再将两个积加和产生的：

$\mathbf{c}_t = \mathbf{f}_t \circ \mathbf{c}_{t-1} + \mathbf{i}_t \circ \mathbf{\tilde{c}}_t \quad (4)$

符号 $\circ$ 表示按元素乘。下图是 $\mathbf{c}_t$ 的计算：

这样，大家就把LSTM关于当前的记忆 $\mathbf{\tilde{c}}_t$ 和长期的记忆 $\mathbf{c}_{t-1}$ 组合在一起，形成了新的单元状态 $\mathbf{c}_t$ 。由于遗忘门的控制，它允许保存很久很久之前的信息，由于输入门的控制，它又可以避免当前无关紧要的内容进入记忆。

2.3.2.3 输出门

输出门控制了长期记忆对当前输出的影响：

$\mathbf{o}_t = \sigma(\mathbf{W}_o \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o) \quad (5)$

下图表示输出门的计算：

LSTM最终的输出，是由输出门和单元状态共同确定的：

$\mathbf{h}_t = \mathbf{o}_t \circ \tanh(\mathbf{c}_t) \quad (6)$

下图表示LSTM最终输出的计算：

式1到式6就是LSTM前向计算的全部公式。

4 LSTM的训练

数学公式高能预警！

4.1 LSTM训练算法框架

LSTM的训练仍然使用反向传播算法，

前向计算每个神经元的输出值。对于LSTM来说就是 $\mathbf{f}_t, \mathbf{i}_t, \mathbf{c}_t, \mathbf{o}_t, \mathbf{h}_t$ 五个向量的值。计算方法已经在上一节中描述过了。
反向计算每个神经元的误差项 $\delta$ 值。LSTM误差项的反向传播也包括两个方向：
- 一个是沿时间的反向传播，即从当前时刻开始，计算每个时刻的误差项
- 一个是将误差项向上一层传播
根据相应的误差项，计算每个权重的梯度。

4.2 公式和符号说明

接下来的推导中，我们设定gate的激活函数为sigmoid函数，输出的激活函数为tanh函数：

$\sigma(z) = y = \frac{1}{1 + e^{-z}} \\ \sigma'(z) = y(1 - y) \\ \tanh(z) = y = \frac{e^z - e^{-z}}{e^z + e^{-z}} \\ \tanh'(z) = 1 - y^2$

关于激活函数，以下文章中有具体讲解：

【前馈神经网络详解与实例】2——激活函数_神经网络激活函数例题-CSDN博客https://blog.csdn.net/colus_SEU/article/details/150534855?spm=1001.2014.3001.5501

LSTM需要学习的参数共有8组，分别是：遗忘门的权重矩阵 $\mathbf{W}_f$ 和偏置项 $\mathbf{b}_f$ 、输入门的权重矩阵 $\mathbf{W}_i$ 和偏置项 $\mathbf{b}_i$ 、输出门的权重矩阵 $\mathbf{W}_o$ 和偏置项 $\mathbf{b}_o$ ，以及计算单元状态的权重矩阵 $\mathbf{W}_c$ 和偏置项 $\mathbf{b}_c$ 。因为权重矩阵的两部分在反向传播中使用不同的公式，因此在后续的推导中，权重矩阵 $\mathbf{W}_f,\mathbf{W}_i,\mathbf{W}_c,\mathbf{W}_o$ 都将被写为分开的两个矩阵： $\mathbf{W}_{fh},\mathbf{W}_{fx},\mathbf{W}_{ih},\mathbf{W}_{ix},\mathbf{W}_{oh},\mathbf{W}_{ox},\mathbf{W}_{ch},\mathbf{W}_{cx}$ 。

我们解释一下按元素乘 $\circ$ 符号：

当 $\circ$ 作用于两个向量时，运算如下：

$\mathbf{a} \circ \mathbf{b} = \begin{bmatrix} a_1 \\ a_2 \\ a_3 \\ \cdots \\ a_n \end{bmatrix} \circ \begin{bmatrix} b_1 \\ b_2 \\ b_3 \\ \cdots \\ b_n \end{bmatrix} = \begin{bmatrix} a_1b_1 \\ a_2b_2 \\ a_3b_3 \\ \cdots \\ a_nb_n \end{bmatrix}$

当 $\circ$ 作用于一个向量和一个矩阵时，运算如下：

$\mathbf{a} \circ \mathbf{X} = \begin{bmatrix} a_1 \\ a_2 \\ a_3 \\ \cdots \\ a_n \end{bmatrix} \circ \begin{bmatrix} x_{11} & x_{12} & x_{13} & \cdots & x_{1n} \\ x_{21} & x_{22} & x_{23} & \cdots & x_{2n} \\ x_{31} & x_{32} & x_{33} & \cdots & x_{3n} \\ \cdots & \cdots & \cdots & \cdots & \cdots \\ x_{n1} & x_{n2} & x_{n3} & \cdots & x_{nn} \end{bmatrix}$

$= \begin{bmatrix} a_1x_{11} & a_1x_{12} & a_1x_{13} & \cdots & a_1x_{1n} \\ a_2x_{21} & a_2x_{22} & a_2x_{23} & \cdots & a_2x_{2n} \\ a_3x_{31} & a_3x_{32} & a_3x_{33} & \cdots & a_3x_{3n} \\ \cdots & \cdots & \cdots & \cdots & \cdots \\ a_nx_{n1} & a_nx_{n2} & a_nx_{n3} & \cdots & a_nx_{nn} \end{bmatrix}$

当 $\circ$ 作用于两个矩阵时，两个矩阵对应位置的元素相乘。

按元素乘可以在某些情况下简化矩阵和向量运算。例如，当一个对角矩阵右乘一个矩阵时，相当于用对角线组成的向量按元素乘那种矩阵：

$\text{diag}[\mathbf{a}] \mathbf{X} = \mathbf{a} \circ \mathbf{X}$

当一个行向量右乘一个对角矩阵时，相当于该行向量元素乘那个矩阵对角线组成的向量：

$\mathbf{a}^T \text{diag}[\mathbf{b}] = \mathbf{a} \circ \mathbf{b}$

上面这两点，在我们后续推导中会多次用到。

在 t 时刻，LSTM的输出值为 $\mathbf{h}_t$ 。我们定义 t 时刻的误差项 $\delta_t$ 为：

$\delta_t \overset{def}{=} \frac{\partial E}{\partial \mathbf{h}_t}$

注意，我们这里假设误差项是损失函数对输出值 $\mathbf{h}_t$ 的导数，而不是对加权输入 $\mathbf{z}_t$ 的导数。因为LSTM有四个加权输入，分别对应 $\mathbf{f}_t,\mathbf{i}_t,\mathbf{c}_t,\mathbf{o}_t$ 四个。但我们仍然需要定义出这四个加权输入，以及他们对应的误差项。就是我们希望往上一层传递一个误差项而不

$\mathbf{z}_{f,t} = \mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f = \mathbf{W}_{fh} \mathbf{h}_{t-1} + \mathbf{W}_{fx} \mathbf{x}_t + \mathbf{b}_f$

$\mathbf{z}_{i,t} = \mathbf{W}_i [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i = \mathbf{W}_{ih} \mathbf{h}_{t-1} + \mathbf{W}_{ix} \mathbf{x}_t + \mathbf{b}_i$

$\mathbf{z}_{\tilde{c},t} = \mathbf{W}_c [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c = \mathbf{W}_{ch} \mathbf{h}_{t-1} + \mathbf{W}_{cx} \mathbf{x}_t + \mathbf{b}_c$

$\mathbf{z}_{o,t} = \mathbf{W}_o [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o = \mathbf{W}_{oh} \mathbf{h}_{t-1} + \mathbf{W}_{ox} \mathbf{x}_t + \mathbf{b}_o$

$\delta_{f,t} \overset{\text{def}}{=} \frac{\partial E}{\partial \mathbf{z}_{f,t}}$

$\delta_{i,t} \overset{\text{def}}{=} \frac{\partial E}{\partial \mathbf{z}_{i,t}}$

$\delta_{\tilde{c},t} \overset{\text{def}}{=} \frac{\partial E}{\partial \mathbf{z}_{\tilde{c},t}}$

$\delta_{o,t} \overset{\text{def}}{=} \frac{\partial E}{\partial \mathbf{z}_{o,t}}$

4.3 误差项沿时间的反向传递

沿时间反向传递误差项，就是要计算出t-1时刻的误差项 $\delta_{t-1}$ 。

$\delta_{t-1}^\top = \frac{\partial E}{\partial \mathbf{h}_{t-1}} = \frac{\partial E}{\partial \mathbf{h}_t} \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \delta_t^\top \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}}$

我们知道， $\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}}$ 是一个 Jacobian 矩阵。假设隐藏层 $h$ 的维度是 $N$ 的话，那么它就是一个 $N \times N$ 矩阵，因此上述公式需要使用转置。其中 $h_t$ 的计算公式如下，即前面的式6和式4：

$\mathbf{h}_t = \mathbf{o}_t \circ \tanh(\mathbf{c}_t) \\ \mathbf{c}_t = \mathbf{f}_t \circ \mathbf{c}_{t-1} + \mathbf{i}_t \circ \tilde{\mathbf{c}}_t$

由于 $\mathbf{o}_t,\mathbf{f}_t,\mathbf{i}_t,\tilde{\mathbf{c}}_t$ 都是 $\mathbf{h}_{t-1}$ 的函数，利用全导数公式可得：

$\delta_t^\top \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}} = \delta_t^\top \frac{\partial \mathbf{h}_t}{\partial \mathbf{o}_t} \frac{\partial \mathbf{o}_t}{\partial \mathbf{z}_{o,t}} \frac{\partial \mathbf{z}_{o,t}}{\partial \mathbf{h}_{t-1}} + \delta_t^\top \frac{\partial \mathbf{h}_t}{\partial \mathbf{c}_t} \frac{\partial \mathbf{c}_t}{\partial \mathbf{f}_t} \frac{\partial \mathbf{f}_t}{\partial \mathbf{z}_{f,t}} \frac{\partial \mathbf{z}_{f,t}}{\partial \mathbf{h}_{t-1}} + \delta_t^\top \frac{\partial \mathbf{h}_t}{\partial \mathbf{c}_t} \frac{\partial \mathbf{c}_t}{\partial \mathbf{i}_t} \frac{\partial \mathbf{i}_t}{\partial \mathbf{z}_{i,t}} \frac{\partial \mathbf{z}_{i,t}}{\partial \mathbf{h}_{t-1}} + \delta_t^\top \frac{\partial \mathbf{h}_t}{\partial \mathbf{c}_t} \frac{\partial \mathbf{c}_t}{\partial \tilde{\mathbf{c}}_t} \frac{\partial \tilde{\mathbf{c}}_t}{\partial \mathbf{z}_{\tilde{\mathbf{c}},t}} \frac{\partial \mathbf{z}_{\tilde{\mathbf{c}},t}}{\partial \mathbf{h}_{t-1}} \\ = \delta_{o,t}^\top \frac{\partial {\mathbf{z}_{o,t}}}{\partial \mathbf{h}_{t-1}} + \delta_{f,t}^\top \frac{\partial \mathbf{z}_{f,t}}{\partial \mathbf{h}_{t-1}} + \delta_{i,t}^\top \frac{\partial \mathbf{z}_{i,t}}{\partial \mathbf{h}_{t-1}} + \delta_{\tilde{c},t}^\top \frac{\partial {\mathbf{z}}_{\tilde{c},t}}{\partial \mathbf{h}_{t-1}} \quad (7)$

下面，我们要把式7中的每个偏导数都求出来。根据式6，我们许可求出：

$\frac{\partial \mathbf{h}_t}{\partial \mathbf{o}_t} = \text{diag}[\tanh(\mathbf{c}_t)]$

$\frac{\partial \mathbf{h}_t}{\partial \mathbf{c}_t} = \text{diag}[\mathbf{o}_t \circ (1 - \tanh(\mathbf{c}_t)^2)]$

根据式4，我们可以求出：

$\frac{\partial \mathbf{c}_t}{\partial \mathbf{f}_t} = \text{diag}[\mathbf{c}_{t-1}]$

$\frac{\partial \mathbf{c}_t}{\partial \tilde{\mathbf{c}}_t} = \text{diag}[\mathbf{i}_t]$

$\frac{\partial \mathbf{c}_t}{\partial \mathbf{i}_t} = \text{diag}[\tilde{\mathbf{c}}_t]$

因为：

$\mathbf{o}_t = \sigma(\mathbf{z}_{o,t}) \\ \mathbf{z}_{o,t} = \mathbf{W}_{oh} \mathbf{h}_{t-1} + \mathbf{W}_{ox} \mathbf{x}_t + \mathbf{b}_o$

$\mathbf{f}_t = \sigma(\mathbf{z}_{f,t}) \\ \mathbf{z}_{f,t} = \mathbf{W}_{fh} \mathbf{h}_{t-1} + \mathbf{W}_{fx} \mathbf{x}_t + \mathbf{b}_f$

$\mathbf{i}_t = \sigma(\mathbf{z}_{i,t}) \\ \mathbf{z}_{i,t} = \mathbf{W}_{ih} \mathbf{h}_{t-1} + \mathbf{W}_{ix} \mathbf{x}_t + \mathbf{b}_i$

$\mathbf{\tilde{c}}_t = \tanh(\mathbf{z}_{\tilde{c},t}) \\ \mathbf{z}_{\tilde{c},t} = \mathbf{W}_{ch} \mathbf{h}_{t-1} + \mathbf{W}_{cx} \mathbf{x}_t + \mathbf{b}_c$

我们很容易得出：

$\frac{\partial \mathbf{o}_t}{\partial \mathbf{z}_{o,t}} = \text{diag}[\mathbf{o}_t \circ (1 - \mathbf{o}_t)] \\ \frac{\partial \mathbf{z}_{o,t}}{\partial \mathbf{h}_{t-1}} = \mathbf{W}_{oh}$

$\frac{\partial \mathbf{f}_t}{\partial \mathbf{z}_{f,t}} = \text{diag}[\mathbf{f}_t \circ (1 - \mathbf{f}_t)] \\ \frac{\partial \mathbf{z}_{f,t}}{\partial \mathbf{h}_{t-1}} = \mathbf{W}_{fh}$

$\frac{\partial \mathbf{i}_t}{\partial \mathbf{z}_{i,t}} = \text{diag}[\mathbf{i}_t \circ (1 - \mathbf{i}_t)] \\ \frac{\partial \mathbf{z}_{i,t}}{\partial \mathbf{h}_{t-1}} = \mathbf{W}_{ih}$

$\frac{\partial \mathbf{\tilde{c}}_t}{\partial \mathbf{z}_{\tilde{c},t}} = \text{diag}[1 - \mathbf{\tilde{c}}_t^2] \\ \frac{\partial \mathbf{z}_{\tilde{c},t}}{\partial \mathbf{h}_{t-1}} = \mathbf{W}_{ch}$

将上述偏导数带入到式7，我们得到：

$\delta_{t-1} = \delta_{o,t}^\top \mathbf{W}_{oh} + \delta_{f,t}^\top \mathbf{W}_{fh} + \delta_{i,t}^\top \mathbf{W}_{ih} + \delta_{\tilde{c},t}^\top \mathbf{W}_{ch} \quad (8)$

根据 $\delta_{o,t},\delta_{f,t},\delta_{i,t},\delta_{\tilde{c},t}$ 的定义，可知：

$\delta_{o,t}^\top = \frac{\partial{E}}{\partial \mathbf{h}_{t}} \frac{\partial{\mathbf{h}_{t}}}{\partial \mathbf{o}_t} \frac{\partial{\mathbf{o}_t}}{\partial \mathbf{z}_{o,t}} = \delta_t^\top \circ \tanh(\mathbf{c}_t) \circ \mathbf{o}_t \circ (1 - \mathbf{o}_t) \quad (9)$

$\delta_{f,t}^\top = \frac{\partial{E}}{\partial \mathbf{h}_{t}} \frac{\partial{\mathbf{h}_{t}}}{\partial \mathbf{c}_t} \frac{\partial{\mathbf{c}_{t}}}{\partial \mathbf{f}_t} \frac{\partial{\mathbf{f}_t}}{\partial \mathbf{z}_{f,t}} =\delta_t^\top \circ \mathbf{o}_t \circ (1 - \tanh(\mathbf{c}_t)^2) \circ \mathbf{c}_{t-1} \circ \mathbf{f}_t \circ (1 - \mathbf{f}_t) \quad (10)$

$\delta_{i,t}^\top = \frac{\partial{E}}{\partial \mathbf{h}_{t}} \frac{\partial{\mathbf{h}_{t}}}{\partial \mathbf{c}_t} \frac{\partial{\mathbf{c}_{t}}}{\partial \mathbf{i}_t} \frac{\partial{\mathbf{i}_t}}{\partial \mathbf{z}_{i,t}} =\delta_t^\top \circ \mathbf{o}_t \circ (1 - \tanh(\mathbf{c}_t)^2) \circ \mathbf{\tilde{c}}_t \circ \mathbf{i}_t \circ (1 - \mathbf{i}_t) \quad (11)$

$\delta_{\tilde{c},t}^\top = \frac{\partial{E}}{\partial \mathbf{h}_{t}} \frac{\partial{\mathbf{h}_{t}}}{\partial \mathbf{c}_t} \frac{\partial{\mathbf{c}_{t}}}{\partial \mathbf{\tilde{c}}_t} \frac{\partial{\mathbf{\tilde{c}}_t}}{\partial \mathbf{z}_{\tilde{c},t}} =\delta_t^\top \circ \mathbf{o}_t \circ (1 - \tanh(\mathbf{c}_t)^2) \circ \mathbf{i}_t \circ (1 - \mathbf{\tilde{c}}_t^2) \quad (12)$

式8到式12就是将误差沿时间反向传播一个时刻的公式。有了它，大家可以写出将误差项向前传递到任意k时刻的公式：

$\delta_k^\top = \prod_{j=k}^{t-1} \delta_{o,j}^\top \mathbf{W}_{oh} + \delta_{f,j}^\top \mathbf{W}_{fh} + \delta_{i,j}^\top \mathbf{W}_{ih} + \delta_{\tilde{c},j}^\top \mathbf{W}_{ch} \quad (13)$

4.4 误差项传递到上一层

大家假设当前为第 $l$ 层，定义 $l-1$ 层的误差项是误差函数对 $l-1$ 层加权输入的导数，即：

$\delta_t^{l-1} \overset{\text{def}}{=} \frac{\partial E}{\partial \mathbf{z}_t^{l-1}}$

LSTM的输入 $\mathbf{x}_t^l$ 由下面的公式计算：

$\mathbf{x}_t^l = f^{l-1}(\mathbf{z}_t^{l-1})$

上式中， $f^{l-1}$ 表示第 $l-1$ 层的激活函数。

因为 $\mathbf{z}_{f,t}^l,\mathbf{z}_{i,t}^l,\mathbf{z}_{c,t}^l,\mathbf{z}_{o,t}^l$ 都是 $\mathbf{x}_t$ 的函数， $\mathbf{x}_t$ 又是 $\mathbf{z}_t^{l-1}$ 的函数，因此，要求出 $E$ 对 $\mathbf{z}_t^{l-1}$ 的导数，就需使用全导数公式：

$\frac{\partial E}{\partial \mathbf{z}_t^{l-1}} = \frac{\partial E}{\partial \mathbf{z}_{f,t}^l} \frac{\partial \mathbf{z}_{f,t}^l}{\partial \mathbf{x}_t^l} \frac{\partial \mathbf{x}_t^l}{\partial \mathbf{z}_t^{l-1}} + \frac{\partial E}{\partial \mathbf{z}_{i,t}^l} \frac{\partial \mathbf{z}_{i,t}^l}{\partial \mathbf{x}_t^l} \frac{\partial \mathbf{x}_t^l}{\partial \mathbf{z}_t^{l-1}} + \frac{\partial E}{\partial \mathbf{z}_{c,t}^l} \frac{\partial \mathbf{z}_{c,t}^l}{\partial \mathbf{x}_t^l} \frac{\partial \mathbf{x}_t^l}{\partial \mathbf{z}_t^{l-1}} + \frac{\partial E}{\partial \mathbf{z}_{o,t}^l} \frac{\partial \mathbf{z}_{o,t}^l}{\partial \mathbf{x}_t^l} \frac{\partial \mathbf{x}_t^l}{\partial \mathbf{z}_t^{l-1}} \\ = \delta_{f,t}^T \mathbf{W}_{fx} \circ f'(\mathbf{z}_t^{l-1}) + \delta_{i,t}^T \mathbf{W}_{ix} \circ f'(\mathbf{z}_t^{l-1}) + \delta_{\tilde{c},t}^T \mathbf{W}_{cx} \circ f'(\mathbf{z}_t^{l-1}) + \delta_{o,t}^T \mathbf{W}_{ox} \circ f'(\mathbf{z}_t^{l-1}) \\ = (\delta_{f,t}^T \mathbf{W}_{fx} + \delta_{i,t}^T \mathbf{W}_{ix} + \delta_{\tilde{c},t}^T \mathbf{W}_{cx} + \delta_{o,t}^T \mathbf{W}_{ox}) \circ f'(\mathbf{z}_t^{l-1}) \quad (14)$

式14就是将误差传递到上一层的公式。

4.5 权重梯度的计算

对于 $\mathbf{W}_{fh},\mathbf{W}_{ih},\mathbf{W}_{ch},\mathbf{W}_{oh}$ 的权重梯度，我们知道它的梯度是各个时刻梯度之和（证明过程请参考【循环神经网络1】一文搞定RNN入门与详解-CSDN博客4.3节），我们首先求出它们在 $t$ 时刻的梯度，随后再求出他们的最终的梯度。

我们已经求得了误差项 $\delta_{o,t},\delta_{f,t},\delta_{i,t},\delta_{\tilde{c},t}$ ，很容易求出 $t$ 时刻的 $\mathbf{W}_{oh},\mathbf{W}_{ih},\mathbf{W}_{fh},\mathbf{W}_{ch}$ ：

$\frac{\partial E}{\partial \mathbf{W}_{oh,t}} = \frac{\partial E}{\partial \mathbf{z}_{o,t}} \frac{\partial \mathbf{z}_{o,t}}{\partial \mathbf{W}_{oh,t}} = \delta_{o,t} \mathbf{h}_{t-1}^\top$

$\frac{\partial E}{\partial \mathbf{W}_{fh,t}} = \frac{\partial E}{\partial \mathbf{z}_{f,t}} \frac{\partial \mathbf{z}_{f,t}}{\partial \mathbf{W}_{fh,t}} = \delta_{f,t} \mathbf{h}_{t-1}^\top$

$\frac{\partial E}{\partial \mathbf{W}_{ih,t}} = \frac{\partial E}{\partial \mathbf{z}_{i,t}} \frac{\partial \mathbf{z}_{i,t}}{\partial \mathbf{W}_{ih,t}} = \delta_{i,t} \mathbf{h}_{t-1}^\top$

$\frac{\partial E}{\partial \mathbf{W}_{ch,t}} = \frac{\partial E}{\partial \mathbf{z}_{\tilde{c},t}} \frac{\partial \mathbf{z}_{\tilde{c},t}}{\partial \mathbf{W}_{ch,t}} = \delta_{\tilde{c},t} \mathbf{h}_{t-1}^\top$

将各个时刻的梯度加在一起，就能得到最终的梯度：

$\frac{\partial E}{\partial \mathbf{W}_{oh}} = \sum_{j=1}^{t} \delta_{o,j} \mathbf{h}_{j-1}^\top$

$\frac{\partial E}{\partial \mathbf{W}_{fh}} = \sum_{j=1}^{t} \delta_{f,j} \mathbf{h}_{j-1}^\top$

$\frac{\partial E}{\partial \mathbf{W}_{ih}} = \sum_{j=1}^{t} \delta_{i,j} \mathbf{h}_{j-1}^\top$

$\frac{\partial E}{\partial \mathbf{W}_{ch}} = \sum_{j=1}^{t} \delta_{\tilde{c},j} \mathbf{h}_{j-1}^\top$

对于偏置项 $\mathbf{b}_f,\mathbf{b}_i,\mathbf{b}_c,\mathbf{b}_o$ 的梯度，也是将各个时刻的梯度加在一起。下面是各个时刻的偏置项梯度：

$\frac{\partial E}{\partial \mathbf{b}_{o,t}} = \frac{\partial E}{\partial \mathbf{z}_{o,t}} \frac{\partial \mathbf{z}_{o,t}}{\partial \mathbf{b}_{o,t}} = \delta_{o,t}$

$\frac{\partial E}{\partial \mathbf{b}_{f,t}} = \frac{\partial E}{\partial \mathbf{z}_{f,t}} \frac{\partial \mathbf{z}_{f,t}}{\partial \mathbf{b}_{f,t}} = \delta_{f,t}$

$\frac{\partial E}{\partial \mathbf{b}_{i,t}} = \frac{\partial E}{\partial \mathbf{z}_{i,t}} \frac{\partial \mathbf{z}_{i,t}}{\partial \mathbf{b}_{i,t}} = \delta_{i,t}$

$\frac{\partial E}{\partial \mathbf{b}_{c,t}} = \frac{\partial E}{\partial \mathbf{z}_{\tilde{c},t}} \frac{\partial \mathbf{z}_{\tilde{c},t}}{\partial \mathbf{b}_{c,t}} = \delta_{\tilde{c},t}$

下面是最终的偏置项梯度，即将各个时刻的偏置项梯度加在一起：

$\frac{\partial E}{\partial \mathbf{b}_o} = \sum_{j=1}^{t} \delta_{o,j}$

$\frac{\partial E}{\partial \mathbf{b}_i} = \sum_{j=1}^{t} \delta_{i,j}$

$\frac{\partial E}{\partial \mathbf{b}_f} = \sum_{j=1}^{t} \delta_{f,j}$

$\frac{\partial E}{\partial \mathbf{b}_c} = \sum_{j=1}^{t} \delta_{\tilde{c},j}$

对于 $\mathbf{W}_{fx},\mathbf{W}_{ix},\mathbf{W}_{cx},\mathbf{W}_{ox}$ 的权重梯度，只需要根据相应的误差项直接计算即可：

$\frac{\partial E}{\partial \mathbf{W}_{ox}} = \frac{\partial E}{\partial \mathbf{z}_{o,t}} \frac{\partial \mathbf{z}_{o,t}}{\partial \mathbf{W}_{ox}} = \delta_{o,t} \mathbf{x}_{t}^T$

$\frac{\partial E}{\partial \mathbf{W}_{fx}} = \frac{\partial E}{\partial \mathbf{z}_{f,t}} \frac{\partial \mathbf{z}_{f,t}}{\partial \mathbf{W}_{fx}} = \delta_{f,t} \mathbf{x}_{t}^T$

$\frac{\partial E}{\partial \mathbf{W}_{ix}} = \frac{\partial E}{\partial \mathbf{z}_{i,t}} \frac{\partial \mathbf{z}_{i,t}}{\partial \mathbf{W}_{ix}} = \delta_{i,t} \mathbf{x}_{t}^T$

$\frac{\partial E}{\partial \mathbf{W}_{cx}} = \frac{\partial E}{\partial \mathbf{z}_{\tilde{c},t}} \frac{\partial \mathbf{z}_{\tilde{c},t}}{\partial \mathbf{W}_{cx}} = \delta_{\tilde{c},t} \mathbf{x}_{t}^T$

posted on 2025-09-29 11:55 slgkaifa 阅读(28) 评论(0) 收藏举报

刷新页面返回顶部

slgkaifa