# IMPLEMENTING A GRU/LSTM RNN WITH PYTHON AND THEANO - 学习笔记

catalogue

0. 引言
1. LSTM NETWORKS
2. LSTM 的变体
3. GRUs (Gated Recurrent Units)
4. IMPLEMENTATION GRUs

0. 引言

In this post we’ll learn about LSTM (Long Short Term Memory) networks and GRUs (Gated Recurrent Units).  LSTMs were first proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber, and are among the most widely used models in Deep Learning for NLP today. GRUs, first used in  2014, are a simpler variant of LSTMs that share many of the same properties.  Let’s start by looking at LSTMs, and then we’ll see how GRUs are different.

0x1: 长期依赖(Long-Term Dependencies)问题

RNN 的关键点之一就是他们可以用来连接先前的信息到当前的任务上，例如使用过去的视频段来推测对当前段的理解。如果 RNN 可以做到这个，他们就变得非常有用。但是真的可以么？答案是，还有很多依赖因素。

http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/

1. LSTM NETWORKS

0x1: LSTM如何从模型结构上规避"长周期依赖梯度消失问题"

In previous post we looked at how the vanishing gradient problem prevents standard RNNs from learning long-term dependencies.

LSTMs were designed to combat vanishing gradients through a gating mechanism.  To understand what this means, let’s look at how a LSTM calculates a hidden state $s_t$ (I’m using $\circ$ to mean elementwise multiplication):

\begin{aligned} i &=\sigma(x_tU^i + s_{t-1} W^i) \\ f &=\sigma(x_t U^f +s_{t-1} W^f) \\ o &=\sigma(x_t U^o + s_{t-1} W^o) \\ g &=\ tanh(x_t U^g + s_{t-1}W^g) \\ c_t &= c_{t-1} \circ f + g \circ i \\ s_t &=\tanh(c_t) \circ o \end{aligned}

These equations look quite complicated, but actually it’s not that hard. First, notice that a LSTM layer is  just another way to compute a hidden state. Previously, we computed the hidden state as $s_t = \tanh(Ux_t + Ws_{t-1})$. The inputs to this unit were $x_t$, the current input at step $t$, and $s_{t-1}$, the previous hidden state.  The output was a new hidden state $s_t$. A LSTM unit does the exact same thing, just in a different way!

LSTM 通过刻意的设计来避免长期依赖问题。记住长期的信息在实践中是 LSTM 的默认行为，而非需要付出很大代价才能获得的能力，所有 RNN 都具有一种重复神经网络模块的链式的形式。在标准的 RNN 中，这个重复的模块只有一个非常简单的结构，例如一个 tanh 层。我们知道，代价函数的反馈难以让网络前端的神经元学习到本质是"梯度消失问题"，即因为偏导数链式求导的关系，导致代价函数C值不断被稀释。LSTM 同样是这样的结构，但是重复的模块拥有一个不同的结构。不同于 单一神经网络层，这里是有四个，以一种非常特殊的方式进行交互

0x2: LSTM 的核心思想

LSTM 有通过精心设计的称作为"门"的结构(门是一个LSTM神经元里面的一个子结构)来去除或者增加信息到细胞状态的能力。门是一种让信息"选择式通过"的方法。他们包含一个 sigmoid 神经网络层和一个 pointwise 乘法操作

Sigmoid 层输出 0 到 1 之间的数值，描述每个部分有多少量可以通过。0 代表“不许任何量通过”，1 就指“允许任意量通过"。LSTM 拥有三个门，来保护和控制细胞状态。

1. 忘记门

2. 输入门层 和 tanh层

1. 第一，sigmoid 层称 “输入门层” 决定什么值我们将要更新
2. 然后，一个 tanh 层创建一个新的候选值向量，C_t，会被加入到状态中

3. 输出门

1. 首先，我们运行一个 sigmoid 层来确定细胞状态的哪个部分将输出出去
2. 接着，我们把细胞状态C_t通过 tanh 进行处理(得到一个在 -1 到 1 之间的值)并将它和 sigmoid 门的输出相乘，最终我们仅仅会输出我们确定输出的那部分(选择性输出)

Intuitively, plain RNNs could be considered a special case of LSTMs. If you fix the input gate all 1’s, the forget gate to all 0’s (you always forget the previous memory) and the output gate to all one’s (you expose the whole memory) you almost get standard RNN. There’s just an additional $\tanh$ that squashes the output a bit. The gating mechanism is what allows LSTMs to explicitly model long-term dependencies. By learning the parameters for its gates, the network learns how its memory should behave.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

2. LSTM 的变体

0x1: peephole connection

0x2: coupled 忘记和输入门

1. 我们仅仅会当我们将要输入在当前位置时忘记
2. 我们仅仅输入新的值到那些我们已经忘记旧的信息的那些状态

https://arxiv.org/pdf/1503.04069.pdf

3. GRUs (Gated Recurrent Units)

GRUs本质上也是LSTM的一个变体，因为使用的比较多，这里单独列一章节讨论。它将忘记门和输入门合成了一个单一的 更新门。同样还混合了细胞状态和隐藏状态，和其他一些改动。最终的模型比标准的 LSTM 模型要简单，也是非常流行的变体。

The idea behind a GRU layer is quite similar to that of a LSTM layer, as are the equations.

A GRU has two gates

1. a reset gate r
2. an update gate z 

Intuitively, the reset gate determines how to combine the new input with the previous memory, and the update gate defines how much of the previous memory to keep around. If we set the reset to all 1’s and  update gate to all 0’s we again arrive at our plain RNN model(类LSTM模型和传统RNN最大的区别就在于它使用了门来进行选择性遗忘和记忆). The basic idea of using a gating mechanism to learn long-term dependencies is the same as in a LSTM, but there are a few key differences:

• A GRU has two gates, an LSTM has three gates.
• GRUs don’t possess and internal memory ($c_t$) that is different from the exposed hidden state. They don’t have the output gate that is present in LSTMs.
• The input and forget gates are coupled by an update gate $z$ and the reset gate $r$ is applied directly to the previous hidden state. Thus, the responsibility of the reset gate in a LSTM is really split up into both $r$ and $z$.
• We don’t apply a second nonlinearity when computing the output.

http://www.jianshu.com/p/9dc9f41f0b29

4. IMPLEMENTATION GRUs
Remember that a GRU (LSTM) layer is just another way of computing the hidden state. So all we really need to do is change the hidden state computation in our forward propagation function.

def forward_prop_step(x_t, s_t1_prev):
# This is how we calculated the hidden state in a simple RNN. No longer!
# s_t = T.tanh(U[:,x_t] + W.dot(s_t1_prev))

# Get the word vector
x_e = E[:,x_t]

# GRU Layer
z_t1 = T.nnet.hard_sigmoid(U[0].dot(x_e) + W[0].dot(s_t1_prev) + b[0])
r_t1 = T.nnet.hard_sigmoid(U[1].dot(x_e) + W[1].dot(s_t1_prev) + b[1])
c_t1 = T.tanh(U[2].dot(x_e) + W[2].dot(s_t1_prev * r_t1) + b[2])
s_t1 = (T.ones_like(z_t1) - z_t1) * c_t1 + z_t1 * s_t1_prev

# Final output calculation
# Theano's softmax returns a matrix with one row, we only need the row
o_t = T.nnet.softmax(V.dot(s_t1) + c)[0]

return [o_t, s_t1]

0x1: code

train.py

#! /usr/bin/env python

import sys
import os
import time
import numpy as np
from utils import *
from datetime import datetime
from gru_theano import GRUTheano

LEARNING_RATE = float(os.environ.get("LEARNING_RATE", "0.001"))
VOCABULARY_SIZE = int(os.environ.get("VOCABULARY_SIZE", "2000"))
EMBEDDING_DIM = int(os.environ.get("EMBEDDING_DIM", "48"))
HIDDEN_DIM = int(os.environ.get("HIDDEN_DIM", "128"))
NEPOCH = int(os.environ.get("NEPOCH", "20"))
MODEL_OUTPUT_FILE = os.environ.get("MODEL_OUTPUT_FILE")
PRINT_EVERY = int(os.environ.get("PRINT_EVERY", "25000"))

if not MODEL_OUTPUT_FILE:
ts = datetime.now().strftime("%Y-%m-%d-%H-%M")
MODEL_OUTPUT_FILE = "GRU-%s-%s-%s-%s.dat" % (ts, VOCABULARY_SIZE, EMBEDDING_DIM, HIDDEN_DIM)

x_train, y_train, word_to_index, index_to_word = load_data(INPUT_DATA_FILE, VOCABULARY_SIZE)

# Build model
model = GRUTheano(VOCABULARY_SIZE, hidden_dim=HIDDEN_DIM, bptt_truncate=-1)

# Print SGD step time
t1 = time.time()
model.sgd_step(x_train[10], y_train[10], LEARNING_RATE)
t2 = time.time()
print "SGD Step time: %f milliseconds" % ((t2 - t1) * 1000.)
sys.stdout.flush()

# We do this every few examples to understand what's going on
def sgd_callback(model, num_examples_seen):
dt = datetime.now().isoformat()
loss = model.calculate_loss(x_train[:10000], y_train[:10000])
print("\n%s (%d)" % (dt, num_examples_seen))
print("--------------------------------------------------")
print("Loss: %f" % loss)
generate_sentences(model, 10, index_to_word, word_to_index)
save_model_parameters_theano(model, MODEL_OUTPUT_FILE)
print("\n")
sys.stdout.flush()

for epoch in range(NEPOCH):
train_with_sgd(model, x_train, y_train, learning_rate=LEARNING_RATE, nepoch=1, decay=0.9,
callback_every=PRINT_EVERY, callback=sgd_callback)

gru_theano.py

# -*- coding: utf-8 -*-

import numpy as np
import theano as theano
import theano.tensor as T
import time
import operator

class GRUTheano:
def __init__(self, word_dim, hidden_dim=128, bptt_truncate=-1):
self.word_dim = word_dim
self.hidden_dim = hidden_dim
self.bptt_truncate = bptt_truncate
# Initialize the network parameters
E = np.random.uniform(-np.sqrt(1. / word_dim), np.sqrt(1. / word_dim), (hidden_dim, word_dim))
U = np.random.uniform(-np.sqrt(1. / hidden_dim), np.sqrt(1. / hidden_dim), (6, hidden_dim, hidden_dim))
W = np.random.uniform(-np.sqrt(1. / hidden_dim), np.sqrt(1. / hidden_dim), (6, hidden_dim, hidden_dim))
V = np.random.uniform(-np.sqrt(1. / hidden_dim), np.sqrt(1. / hidden_dim), (word_dim, hidden_dim))
b = np.zeros((6, hidden_dim))
c = np.zeros(word_dim)
# Theano: Created shared variables
self.E = theano.shared(name='E', value=E.astype(theano.config.floatX))
self.U = theano.shared(name='U', value=U.astype(theano.config.floatX))
self.W = theano.shared(name='W', value=W.astype(theano.config.floatX))
self.V = theano.shared(name='V', value=V.astype(theano.config.floatX))
self.b = theano.shared(name='b', value=b.astype(theano.config.floatX))
self.c = theano.shared(name='c', value=c.astype(theano.config.floatX))
# SGD / rmsprop: Initialize parameters
self.mE = theano.shared(name='mE', value=np.zeros(E.shape).astype(theano.config.floatX))
self.mU = theano.shared(name='mU', value=np.zeros(U.shape).astype(theano.config.floatX))
self.mV = theano.shared(name='mV', value=np.zeros(V.shape).astype(theano.config.floatX))
self.mW = theano.shared(name='mW', value=np.zeros(W.shape).astype(theano.config.floatX))
self.mb = theano.shared(name='mb', value=np.zeros(b.shape).astype(theano.config.floatX))
self.mc = theano.shared(name='mc', value=np.zeros(c.shape).astype(theano.config.floatX))
# We store the Theano graph here
self.theano = {}
self.__theano_build__()

def __theano_build__(self):
E, V, U, W, b, c = self.E, self.V, self.U, self.W, self.b, self.c

x = T.ivector('x')
y = T.ivector('y')

def forward_prop_step(x_t, s_t1_prev, s_t2_prev):
# This is how we calculated the hidden state in a simple RNN. No longer!
# s_t = T.tanh(U[:,x_t] + W.dot(s_t1_prev))

# Word embedding layer
x_e = E[:, x_t]

# GRU Layer 1
z_t1 = T.nnet.hard_sigmoid(U[0].dot(x_e) + W[0].dot(s_t1_prev) + b[0])
r_t1 = T.nnet.hard_sigmoid(U[1].dot(x_e) + W[1].dot(s_t1_prev) + b[1])
c_t1 = T.tanh(U[2].dot(x_e) + W[2].dot(s_t1_prev * r_t1) + b[2])
s_t1 = (T.ones_like(z_t1) - z_t1) * c_t1 + z_t1 * s_t1_prev

# GRU Layer 2
z_t2 = T.nnet.hard_sigmoid(U[3].dot(s_t1) + W[3].dot(s_t2_prev) + b[3])
r_t2 = T.nnet.hard_sigmoid(U[4].dot(s_t1) + W[4].dot(s_t2_prev) + b[4])
c_t2 = T.tanh(U[5].dot(s_t1) + W[5].dot(s_t2_prev * r_t2) + b[5])
s_t2 = (T.ones_like(z_t2) - z_t2) * c_t2 + z_t2 * s_t2_prev

# Final output calculation
# Theano's softmax returns a matrix with one row, we only need the row
o_t = T.nnet.softmax(V.dot(s_t2) + c)[0]

return [o_t, s_t1, s_t2]

[o, s, s2], updates = theano.scan(
forward_prop_step,
sequences=x,
outputs_info=[None,
dict(initial=T.zeros(self.hidden_dim)),
dict(initial=T.zeros(self.hidden_dim))])

prediction = T.argmax(o, axis=1)
o_error = T.sum(T.nnet.categorical_crossentropy(o, y))

# Total cost (could add regularization here)
cost = o_error

# Assign functions
self.predict = theano.function([x], o)
self.predict_class = theano.function([x], prediction)
self.ce_error = theano.function([x, y], cost)
self.bptt = theano.function([x, y], [dE, dU, dW, db, dV, dc])

# SGD parameters
learning_rate = T.scalar('learning_rate')
decay = T.scalar('decay')

mE = decay * self.mE + (1 - decay) * dE ** 2
mU = decay * self.mU + (1 - decay) * dU ** 2
mW = decay * self.mW + (1 - decay) * dW ** 2
mV = decay * self.mV + (1 - decay) * dV ** 2
mb = decay * self.mb + (1 - decay) * db ** 2
mc = decay * self.mc + (1 - decay) * dc ** 2

self.sgd_step = theano.function(
[x, y, learning_rate, theano.Param(decay, default=0.9)],
[],
updates=[(E, E - learning_rate * dE / T.sqrt(mE + 1e-6)),
(U, U - learning_rate * dU / T.sqrt(mU + 1e-6)),
(W, W - learning_rate * dW / T.sqrt(mW + 1e-6)),
(V, V - learning_rate * dV / T.sqrt(mV + 1e-6)),
(b, b - learning_rate * db / T.sqrt(mb + 1e-6)),
(c, c - learning_rate * dc / T.sqrt(mc + 1e-6)),
(self.mE, mE),
(self.mU, mU),
(self.mW, mW),
(self.mV, mV),
(self.mb, mb),
(self.mc, mc)
])

def calculate_total_loss(self, X, Y):
return np.sum([self.ce_error(x, y) for x, y in zip(X, Y)])

def calculate_loss(self, X, Y):
# Divide calculate_loss by the number of words
num_words = np.sum([len(y) for y in Y])
return self.calculate_total_loss(X, Y) / float(num_words)

http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/
https://github.com/dennybritz/rnn-tutorial-gru-lstm/blob/master/gru_theano.py