RECURRENT NEURAL NETWORKS(RNN) TUTORIAL - 学习笔记

catalogue

0. 引言
1. TRAINING DATA AND PREPROCESSING
2. BUILDING THE RNN
3. TRAINING OUR NETWORK WITH THEANO AND GENERATING TEXT
4. RNN Extension

0. 引言

0x1: WHAT ARE RNNS

The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea. If you want to predict the next word in a sentence you better know which words came before it.
RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations.
Another way to think about RNNs is that they have a "memory"(对上一次输入样本的memory) which captures information about what has been calculated so far.
In theory RNNs can make use of information in arbitrarily long sequences(RNN能处理任意长度的输入序列), but in practice they are limited to looking back only a few steps(RNN对之前的输入样本单元的记忆能力有限) (more on this later). Here is what a typical RNN looks like:

The above diagram shows a RNN being unrolled (or unfolded) into a full network. By unrolling we simply mean that we write out the network for the complete sequence. For example, if the sequence we care about is a sentence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word(RNN使用序列的方式处理输入样本). The formulas that govern the computation happening in a RNN are as follows:

• $x_t$ is the input at time step $t$. For example, $x_1$ could be a one-hot vector corresponding to the second word of a sentence.
• $s_t$ is the hidden state at time step $t$. It’s the “memory” of the network. $s_t$ is calculated based on the previous hidden state and the input at the current step: $s_t=f(Ux_t + Ws_{t-1})$. The function $f$ usually is a nonlinearity such as tanh or ReLU.  $s_{-1}$, which is required to calculate the first hidden state, is typically initialized to all zeroes.
• $o_t$ is the output at step $t$. For example, if we wanted to predict the next word in a sentence it would be a vector of probabilities across our vocabulary. $o_t = \mathrm{softmax}(Vs_t)$.
1. 第t个时间步的隐藏状态是h_t。它是同一时间步的输入x_t的函数
2. 由一个权重矩阵W(和我们在前馈网络中使用的一样)修正
3. 加上前一时间步的隐藏状态h_t-1乘以它自己的隐藏状态－隐藏状态矩阵的U(或称过渡矩阵，与马尔可夫链近似)
4. 权重矩阵是决定赋予当前输入及过去隐藏状态多少重要性的筛选器。它们所产生的误差将会通过反向传播返回，用于调整权重，直到误差不能再降低为止

• You can think of the hidden state $s_t$ as the memory of the network. $s_t$ captures information about what happened in all the previous time steps. The output at step $o_t$ is calculated solely based on the memory at time $t$. As briefly mentioned above, it’s a bit more complicated  in practice because $s_t$ typically can’t capture information from too many time steps ago.
• Unlike a traditional deep neural network, which uses different parameters at each layer, a RNN shares the same parameters ($U, V, W$ above) across all steps. This reflects the fact that we are performing the same task at each step(RNN把序列中的每一个元素样本都平等看待，当作同样的任务在处理), just with different inputs. This greatly reduces the total number of parameters we need to learn.
• The above diagram has outputs at each time step(RNN中每一步都可以作出next step预测), but depending on the task this may not be necessary. For example, when predicting the sentiment of a sentence we may only care about the final output, not the sentiment after each word. Similarly, we may not need inputs at each time step. The main feature of an RNN is its hidden state, which captures some information about a sequence.

序列本身即带有信息，而递归网络能利用这种信息完成前馈网络无法完成的任务

RNN处理过程序列化这个特点给它带来了一个CNN所不具备的优点

RNN的重要特性是可以处理不定长的输入，得到一定的输出。当你的输入可长可短， 比如训练翻译模型的时候， 你的句子长度都不固定，你是无法像一个训练固定像素的图像那样用CNN搞定的。而利用RNN的循环特性可以轻松搞定

0x2: WHAT CAN RNNS DO?

RNNs have shown great success in many NLP tasks. At this point I should mention that the most commonly used type of RNNs are LSTMs, which are much better at capturing long-term dependencies than vanilla RNNs are

1. LANGUAGE MODELING AND GENERATING TEXT

Given a sequence of words we want to predict the probability of each word given the previous words. Language Models allow us to measure how likely a sentence is, which is an important input for Machine Translation

2. MACHINE TRANSLATION

Machine Translation is similar to language modeling in that our input is a sequence of words in our source language (e.g. German). We want to output a sequence of words in our target language (e.g. English). A key difference is that our output only starts after we have seen the complete input, because the first word of our translated sentences may require information captured from the complete input sequence.

3. SPEECH RECOGNITION

Given an input sequence of acoustic signals from a sound wave, we can predict a sequence of phonetic segments together with their probabilities.

4. GENERATING IMAGE DESCRIPTIONS

Together with convolutional Neural Networks, RNNs have been used as part of a model to generate descriptions for unlabeled images. It’s quite amazing how well this seems to work. The combined model even aligns the generated words with features found in the images.

0x3: RNN的图灵完备性

"循环"两个字，已经点出了RNN的核心特征， 即系统的输出会保留在网络里和系统下一刻的输入一起共同决定下一刻的输出。这就把动力学的本质体现了出来， 循环正对应动力学系统的反馈概念，可以刻画复杂的历史依赖。另一个角度看也符合著名的图灵机原理。 即此刻的状态包含上一刻的历史，又是下一刻变化的依据。 这其实包含了可编程神经网络的核心概念，即， 当你有一个未知的过程，但你可以测量到输入和输出， 你假设当这个过程通过RNN的时候，它是可以自己学会这样的输入输出规律的， 而且因此具有预测能力。 在这点上说， RNN是图灵完备的

1. 图1即CNN的架构
2. 图2是把单一输入转化为序列输出，例如把图像转化成一行文字
3. 图三是把序列输入转化为单个输出，比如情感测试，测量一段话正面或负面的情绪
4. 图四是把序列转化为序列，最典型的是机器翻译
5. 图5是无时差(注意输入和输出的"时差")的序列到序列转化， 比如给一个录像中的每一帧贴标签(每一个中间状态都输出一个output)

0x3: LANGUAGE MODELING

RNN这种针对序列进行训练和序列预测的算法模型，最适合处理NLP这类的"泛文本序列数据"(因为广义上讲，语音也是文本序列)

Our goal is to build a Language Model using a Recurrent Neural Network. Here’s what that means. Let’s say we have sentence of m words. A language model allows us to predict the probability of observing the sentence (in a given dataset) as:

In words, the probability of a sentence is the product of probabilities of each word given the words that came before it. So, the probability of the sentence “He went to buy some chocolate” would be the probability of

# He went to buy some chocolate
1. "chocolate":  ->  "He went to buy some"
2. "some": ->  "He went to buy"
3. "buy": -> "He went to"
4. "to": -> "He went"
5. "went": -> "He"
6. "He": -> "initial word"

Note that in the above equation the probability of each word is conditioned on all previous words. In practice, many models have a hard time representing such long-term dependencies due to computational or memory constraints. They are typically limited to looking at only a few of the previous words. RNNs can, in theory, capture such long-term dependencies, but in practice it’s a bit more complex.

https://www.zhihu.com/question/36824148
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ 

1. TRAINING DATA AND PREPROCESSING

To train our language model we need text to learn from. Fortunately we don’t need any labels to train a language model, just raw text. I downloaded 15,000 longish reddit comments from a dataset available on Google’s BigQuery. Text generated by our model will sound like reddit commenters (hopefully)! But as with most Machine Learning projects we first need to do some pre-processing to get our data into the right format.

https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2015_08

0x1: TOKENIZE TEXT

We have raw text, but we want to make predictions on a per-word basis. This means we must tokenize our comments into sentences, and sentences into words. We could just split each of the comments by spaces, but that wouldn’t handle punctuation properly. The sentence “He left!” should be 3 tokens: “He”, “left”, “!”. We’ll use NLTK’s word_tokenize and sent_tokenize methods, which do most of the hard work for us.
0x2: REMOVE INFREQUENT WORDS

Most words in our text will only appear one or two times. It’s a good idea to remove these infrequent words. Having a huge vocabulary(词汇表) will make our model slow to train, and because we don’t have a lot of contextual examples for such words we wouldn’t be able to learn how to use them correctly anyway. That’s quite similar to how humans learn. To really understand how to appropriately use a word you need to have seen it in different contexts.(一个太过孤立且没有足够上下文支撑的词是不能从中学到任何用法的)

In our code we limit our vocabulary to the vocabulary_size most common words (which I set to 8000, but feel free to change it). We replace all words not included in our vocabulary by UNKNOWN_TOKEN.
For example, if we don’t include the word “nonlinearities” in our vocabulary, the sentence “nonlineraties are important in neural networks” becomes “UNKNOWN_TOKEN are important in Neural Networks”. The word UNKNOWN_TOKEN will become part of our vocabulary and we will predict it just like any other word.
When we generate new text we can replace UNKNOWN_TOKEN again, for example by taking a randomly sampled word not in our vocabulary, or we could just generate sentences until we get one that doesn’t contain an unknown token.

0x3: PREPEND SPECIAL START AND END TOKENS

We also want to learn which words tend start and end a sentence. To do this we prepend a special SENTENCE_START token, and append a special SENTENCE_END token to each sentence. This allows us to ask

Given that the first token is SENTENCE_START, what is the likely next word (the actual first word of the sentence)?

0x4: BUILD TRAINING DATA MATRICES

The input to our Recurrent Neural Networks are vectors, not strings. So we create a mapping between words and indices, index_to_word, and word_to_index.
For example,  the word “friendly” may be at index 2001. A training example x may look like [0, 179, 341, 416], where 0 corresponds to SENTENCE_START. The corresponding label y would be [179, 341, 416, 1].
Remember that our goal is to predict the next word(逐字预测), so y is just the x vector shifted by one position with the last element being the SENTENCE_END token. In other words, the correct prediction for word 179 above would be 341, the actual next word.

Here’s an actual training example from our text:

x:
[0, 51, 27, 16, 10, 856, 53, 25, 34, 69]

y:
[51, 27, 16, 10, 856, 53, 25, 34, 69, 1]

http://www.nltk.org/

2. BUILDING THE RNN

Training a RNN is similar to training a traditional Neural Network. We also use the backpropagation algorithm, but with a little twist. Because the parameters are shared by all time steps in the network, the gradient at each output depends not only on the calculations of the current time step, but also the previous time steps. For example, in order to calculate the gradient at $t=4$ we would need to backpropagate 3 steps and sum up the gradients. This is called Backpropagation Through Time (BPTT)

Let’s get concrete and see what the RNN for our language model looks like。But there’s one more thing: Because of how matrix multiplication works we can’t simply use a word index (like 36) as an input. Instead, we represent each word as a one-hot vector of size vocabulary_size(用整个vocabulary_size进行one-hot编码，这种方法有些浪费存储). For example, the word with index 36 would be the vector of all 0’s and a 1 at position 36

1. So, each x_t will become a vector
2. and x will be a matrix, with each row representing a word. 

We’ll perform this transformation in our Neural Network code instead of doing it in the pre-processing. The output of our network o has a similar format. Each o_t is a vector of vocabulary_size elements, and each element represents the probability of that word being the next word in the sentence.
Let’s recap the equations for the RNN from the first part of the tutorial:

Let’s assume we pick a vocabulary size $C = 8000$ and a hidden layer size $H = 100$. You can think of the hidden layer size as the “memory” of our network. Making it bigger allows us to learn more complex patterns, but also results in additional computation. Then we have

This is valuable information. Remember that $U,V$ and $W$ are the parameters of our network we want to learn from data. Thus, we need to learn a total of $2HC + H^2$ parameters. In the case of $C=8000$ and $H=100$ that’s 1,610,000. The dimensions also tell us the bottleneck of our model. Note that because $x_t$ is a one-hot vector, multiplying it with $U$ is essentially the same as selecting a column of U, so we don’t need to perform the full multiplication. Then, the biggest matrix multiplication in our network is $Vs_t$. That’s why we want to keep our vocabulary size small if possible

0x1: INITIALIZATION(超参数初始化)

We start by declaring a RNN class an initializing our parameters. I’m calling this class RNNNumpy because we will implement a Theano version later. Initializing the parameters $U,V$ and $W$ is a bit tricky. We can’t just initialize them to 0’s because that would result in symmetric calculations in all our layers. We must initialize them randomly. Because proper initialization seems to have an impact on training results there has been lot of research in this area. It turns out that the best initialization depends on the activation function ($\tanh$ in our case) and one recommended  approach is to initialize the weights randomly in the interval from $\left[-\frac{1}{\sqrt{n}}, \frac{1}{\sqrt{n}}\right]$ where $n$ is the number of incoming connections from the previous layer.

class RNNNumpy:

def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):
self.word_dim = word_dim
self.hidden_dim = hidden_dim
self.bptt_truncate = bptt_truncate
# Randomly initialize the network parameters
self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim))
self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))

Above, word_dim is the size of our vocabulary, and hidden_dim is the size of our hidden layer (we can pick it).

0x2: FORWARD PROPAGATION(前向反馈)

Next, let’s implement the forward propagation (predicting word probabilities) defined by our equations above:

def forward_propagation(self, x):
# The total number of time steps
T = len(x)
# During forward propagation we save all hidden states in s because need them later.
# We add one additional element for the initial hidden, which we set to 0
s = np.zeros((T + 1, self.hidden_dim))
s[-1] = np.zeros(self.hidden_dim)
# The outputs at each time step. Again, we save them for later.
o = np.zeros((T, self.word_dim))
# For each time step...
for t in np.arange(T):
# Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector.
s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))
o[t] = softmax(self.V.dot(s[t]))
return [o, s]

RNNNumpy.forward_propagation = forward_propagation

We not only return the calculated outputs, but also the hidden states. We will use them later to calculate the gradients(网络对当前序列的最终预测输出如何目标值不一样，则需要根据代价函数反向传播来调整整条链路上的所有神经元的超参数), and by returning them here we avoid duplicate computation. Each $o_t$ is a vector of probabilities representing the words in our vocabulary, but sometimes, for example when evaluating our model, all we want is the next word with the highest probability(我们需要网络在每一个神经元预测输出最大概率的下一个字符). We call this function predict:

def predict(self, x):
# Perform forward propagation and return index of the highest score
o, s = self.forward_propagation(x)
return np.argmax(o, axis=1) # 以2维数组的第一维角度计算最大值

RNNNumpy.predict = predict

Let’s try our newly implemented methods and see an example output:

np.random.seed(10)
model = RNNNumpy(vocabulary_size)
o, s = model.forward_propagation(X_train[10])
print o.shape
print o
(45, 8000)
[[ 0.00012408  0.0001244   0.00012603 ...,  0.00012515  0.00012488
0.00012508]
[ 0.00012536  0.00012582  0.00012436 ...,  0.00012482  0.00012456
0.00012451]
[ 0.00012387  0.0001252   0.00012474 ...,  0.00012559  0.00012588
0.00012551]
...,
[ 0.00012414  0.00012455  0.0001252  ...,  0.00012487  0.00012494
0.0001263 ]
[ 0.0001252   0.00012393  0.00012509 ...,  0.00012407  0.00012578
0.00012502]
[ 0.00012472  0.0001253   0.00012487 ...,  0.00012463  0.00012536
0.00012665]]

1. 输入的字符串长度为45，即有45个字符，则RNN序列需要处理45次
2. RNN神经网络把每一步预测的结果都输出出来，整体打包为了一个list，即(45, 8000)
3. 取其中一个元素看，它代表了一次一个神经元根据当前序列index预测出的next word的概率数组，它包含vocabulary_size = 8000个，代表了该神经元对vocabulary_size(8000)个one-hot的字符都进行了概率预测  

For each word in the sentence (45 above), our model made 8000 predictions representing probabilities of the next word. Note that because we initialized $U,V,W$ to random values these predictions are completely random right now. The following gives the indices of the highest probability predictions for each word:

predictions = model.predict(X_train[10])
print predictions.shape
print predictions
(45,)
[1284 5221 7653 7430 1013 3562 7366 4860 2212 6601 7299 4556 2481 238 2539
21 6548 261 1780 2005 1810 5376 4146 477 7051 4832 4991 897 3485 21
7291 2007 6006 760 4864 2182 6569 2800 2752 6821 4437 7021 7875 6912 3575]

0x3: CALCULATING THE LOSS(计算代价函数)

To train our network we need a way to measure the errors it makes. We call this the loss function $L$, and our goal is find the parameters $U,V$ and $W$ that minimize the loss function for our training data. A common choice for the loss function is the cross-entropy loss. If we have $N$ training examples (words in our text) and $C$ classes (the size of our vocabulary) then the loss with respect to our predictions $o$ and the true labels $y$ is given by:

The further away $y$ (the correct words) and $o$ (our predictions), the greater the loss will be. We implement the function calculate_loss:

def calculate_total_loss(self, x, y):
L = 0
# For each sentence...
for i in np.arange(len(y)):
o, s = self.forward_propagation(x[i])
# We only care about our prediction of the "correct" words
correct_word_predictions = o[np.arange(len(y[i])), y[i]]
# Add to the loss based on how off we were
L += -1 * np.sum(np.log(correct_word_predictions))
return L

def calculate_loss(self, x, y):
# Divide the total loss by the number of training examples
N = np.sum((len(y_i) for y_i in y))
return self.calculate_total_loss(x,y)/N

RNNNumpy.calculate_total_loss = calculate_total_loss
RNNNumpy.calculate_loss = calculate_loss

Let’s take a step back and think about what the loss should be for random predictions. That will give us a baseline and make sure our implementation is correct(考虑一个基线情况，平均每个字符的随机预测成功率是1/C，N个字符预测正确). We have $C$ words in our vocabulary, so each word should be (on average) predicted with probability $1/C$, which would yield a loss of $L = -\frac{1}{N} N \log\frac{1}{C} = \log C$: 也就是说，在随机初始化超参数的前提下，且不作任何优化调整，代价函数C最差的情况也就是这样的

# Limit to 1000 examples to save time
print "Expected Loss for random predictions: %f" % np.log(vocabulary_size)
print "Actual loss: %f" % model.calculate_loss(X_train[:1000], y_train[:1000])

Expected Loss for random predictions: 8.987197
Actual loss: 8.987440

Keep in mind that evaluating the loss on the full dataset is an expensive operation and can take hours if you have a lot of data!

0x4: TRAINING THE RNN WITH SGD AND BACKPROPAGATION THROUGH TIME (BPTT)

Let’s quickly recap the basic equations of our RNN.

We also defined our loss, or error, to be the cross entropy loss, given by:

Here, $y_t$ is the correct word at time step $t$, and $\hat{y}_t$ is our prediction. We typically treat the full sequence (sentence) as one training example, so the total error is just the sum of the errors at each time step (word).

Remember that our goal is to calculate the gradients of the error with respect to our parameters $U, V$ and $W$ and then learn good parameters using Stochastic Gradient Descent.

1. V 超参数

V是一种被称之过度矩阵的变量，只和当前t有关，并不存在时间序列依赖

In the above, $z_3 =Vs_3$, and $\otimes$ is the outer product of two vectors. The point I’m trying to get across is that $\frac{\partial E_3}{\partial V}$ only depends on the values at the current time step, $\hat{y}_3, y_3, s_3$. If you have these, calculating the gradient for $V$ a simple matrix multiplication.

2. W/U 超参数

But the story is different for $\frac{\partial E_3}{\partial W}$ (and for $U$). To see why, we write out the chain rule, just as above:

\begin{aligned} \frac{\partial E_3}{\partial W} &= \frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial s_3}\frac{\partial s_3}{\partial W}\\ \end{aligned}

Now, note that $s_3 = \tanh(Ux_t + Ws_2)$ depends on $s_2$, which depends on $W$ and $s_1$, and so on. So if we take the derivative with respect to $W$ we can’t simply treat $s_2$ as a constant! We need to apply the chain rule again and what we really have is this:

\begin{aligned} \frac{\partial E_3}{\partial W} &= \sum\limits_{k=0}^{3} \frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k}{\partial W}\\ \end{aligned}

We sum up the contributions of each time step to the gradient. In other words, because $W$ is used in every step up to the output we care about, we need to backpropagate gradients from $t=3$ through the network all the way to $t=0$:

Note that this is exactly the same as the standard backpropagation algorithm that we use in deep Feedforward Neural Networks. The key difference is that we sum up the gradients for W at each time step. In a traditional NN we don’t share parameters across layers, so we don’t need to sum anything

def bptt(self, x, y):
T = len(y)
# Perform forward propagation
o, s = self.forward_propagation(x)
# We accumulate the gradients in these variables
dLdU = np.zeros(self.U.shape)
dLdV = np.zeros(self.V.shape)
dLdW = np.zeros(self.W.shape)
delta_o = o
delta_o[np.arange(len(y)), y] -= 1.
# For each output backwards... 每一步t都要计算一个代价函数的偏导数
for t in np.arange(T)[::-1]:
dLdV += np.outer(delta_o[t], s[t].T)
# Initial delta calculation: dL/dz
delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
# Backpropagation through time (for at most self.bptt_truncate steps)
for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:
# print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
dLdW += np.outer(delta_t, s[bptt_step-1])
dLdU[:,x[bptt_step]] += delta_t
# Update delta for next step dL/dz at t-1
delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
# 最后返回总的W和U，可以看到，W和U想级数和的形式，即1+2+3+...+N的形式，越到后面的神经元，需要反向的t的次数越多
return [dLdU, dLdV, dLdW]

Just like with Backpropagation you could define a delta vector that you pass backwards, e.g.: $\delta_2^{(3)} = \frac{\partial E_3}{\partial z_2} =\frac{\partial E_3}{\partial s_3}\frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial z_2}$ with $z_2 = Ux_2+ Ws_1$. Then the same equations will apply.

RNNs are hard to train: Sequences (sentences) can be quite long, perhaps 20 words or more, and thus you need to back-propagate through many layers. In practice many people truncate the backpropagation to a few steps.

Whenever you implement backpropagation it is good idea to also implement gradient checking, which is a way of verifying that your implementation is correct. The idea behind gradient checking is that derivative of a parameter is equal to the slope at the point, which we can approximate by slightly changing the parameter and then dividing by the change:

We then compare the gradient we calculated using backpropagation to the gradient we estimated with the method above. If there’s no large difference we are good. The approximation needs to calculate the total loss for every parameter, so that gradient checking is very expensive (remember, we had more than a million parameters in the example above). So it’s a good idea to perform it on a model with a smaller vocabulary.

def gradient_check(self, x, y, h=0.001, error_threshold=0.01):
# Calculate the gradients using backpropagation. We want to checker if these are correct.
# List of all parameters we want to check.
model_parameters = ['U', 'V', 'W']
# Gradient check for each parameter
for pidx, pname in enumerate(model_parameters):
# Get the actual parameter value from the mode, e.g. model.W
parameter = operator.attrgetter(pname)(self)
print "Performing gradient check for parameter %s with size %d." % (pname, np.prod(parameter.shape))
# Iterate over each element of the parameter matrix, e.g. (0,0), (0,1), ...
while not it.finished:
ix = it.multi_index
# Save the original value so we can reset it later
original_value = parameter[ix]
# Estimate the gradient using (f(x+h) - f(x-h))/(2*h)
parameter[ix] = original_value + h
parameter[ix] = original_value - h
# Reset parameter to original value
parameter[ix] = original_value
# The gradient for this parameter calculated using backpropagation
# calculate The relative error: (|x - y|/(|x| + |y|))
# If the error is to large fail the gradient check
if relative_error &gt; error_threshold:
print "Gradient Check ERROR: parameter=%s ix=%s" % (pname, ix)
print "+h Loss: %f" % gradplus
print "-h Loss: %f" % gradminus
print "Relative Error: %f" % relative_error
return
it.iternext()
print "Gradient check for parameter %s passed." % (pname)

# To avoid performing millions of expensive calculations we use a smaller vocabulary size for checking.
np.random.seed(10)
model.gradient_check([0,1,2,3], [1,2,3,4])

0x6: SGD IMPLEMENTATION

Now that we are able to calculate the gradients for our parameters we can implement SGD. I like to do this in two steps: 1. A function sdg_step that calculates the gradients and performs the updates for one batch. 2. An outer loop that iterates through the training set and adjusts the learning rate.

# Performs one step of SGD.
def numpy_sdg_step(self, x, y, learning_rate):
dLdU, dLdV, dLdW = self.bptt(x, y)
# Change parameters according to gradients and learning rate
self.U -= learning_rate * dLdU
self.V -= learning_rate * dLdV
self.W -= learning_rate * dLdW

RNNNumpy.sgd_step = numpy_sdg_step
# Outer SGD Loop
# - model: The RNN model instance
# - X_train: The training data set
# - y_train: The training data labels
# - learning_rate: Initial learning rate for SGD
# - nepoch: Number of times to iterate through the complete dataset
# - evaluate_loss_after: Evaluate the loss after this many epochs
def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=100, evaluate_loss_after=5):
# We keep track of the losses so we can plot them later
losses = []
num_examples_seen = 0
for epoch in range(nepoch):
# Optionally evaluate the loss
if (epoch % evaluate_loss_after == 0):
loss = model.calculate_loss(X_train, y_train)
losses.append((num_examples_seen, loss))
time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print "%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss)
# Adjust the learning rate if loss increases
if (len(losses) &gt; 1 and losses[-1][1] &gt; losses[-2][1]):
learning_rate = learning_rate * 0.5
print "Setting learning rate to %f" % learning_rate
sys.stdout.flush()
# For each training example...
for i in range(len(y_train)):
# One SGD step
model.sgd_step(X_train[i], y_train[i], learning_rate)
num_examples_seen += 1

RNNs have difficulties learning long-range dependencies – interactions between words that are several steps apart. That’s problematic because the meaning of an English sentence is often determined by words that aren’t very close: “The man who wore a wig on his head went inside”. The sentence is really about a man going inside, not about the wig. But it’s unlikely that a plain RNN would be able capture such information. To understand why, let’s take a closer look at the gradient we calculated above:

Note that $\frac{\partial s_3}{\partial s_k}$ is a chain rule in itself! For example, $\frac{\partial s_3}{\partial s_1} =\frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial s_1}$. Also note that because we are taking the derivative of a vector function with respect to a vector, the result is a matrix (called the Jacobian matrix) whose elements are all the pointwise derivatives. We can rewrite the above gradient:

\begin{aligned} \frac{\partial E_3}{\partial W} &= \sum\limits_{k=0}^{3} \frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial s_3} \left(\prod\limits_{j=k+1}^{3} \frac{\partial s_j}{\partial s_{j-1}}\right) \frac{\partial s_k}{\partial W}\\ \end{aligned}

It turns out that the 2-norm, which you can think of it as an absolute value, of the above Jacobian matrix has an upper bound of 1. This makes intuitive sense because our $tanh$ (or sigmoid) activation function maps all values into a range between -1 and 1, and the derivative is bounded by 1 (1/4 in the case of sigmoid. 1 in the case of tanh ) as well:

You can see that the $\tanh$ and sigmoid functions have derivatives of 0 at both ends. They approach a  flat line. When this happens we say the corresponding neurons are saturated. They have a zero gradient and drive other gradients in previous layers towards 0. Thus, with small values in the matrix and multiple matrix multiplications ($t-k$ in particular) the gradient values are shrinking exponentially fast, eventually vanishing completely after a few time steps. Gradient contributions from “far away” steps become zero, and the state at those steps doesn’t contribute to what you are learning: You end up not learning long-range dependencies. Vanishing gradients aren’t exclusive to RNNs. They also happen in deep Feedforward Neural Networks. It’s just that RNNs tend to be very deep (as deep as the sentence length in our case), which makes the problem a lot more common.

Fortunately, there are a few ways to combat the vanishing gradient problem.

1. Proper initialization of the W matrix can reduce the effect of vanishing gradients. So can regularization.
2. A more preferred solution is to use ReLU instead of tanh or sigmoid activation functions. The ReLU derivative is a constant of either 0 or 1, so it isn’t as likely to suffer from vanishing gradients.
3. An even more popular solution is to use Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) architectures.
1) LSTMs were first proposed in 1997 and are the perhaps most widely used models in NLP today.
2) GRUs, first proposed in 2014, are simplified versions of LSTMs. 

https://github.com/dennybritz/rnn-tutorial-rnnlm
https://github.com/dennybritz/rnn-tutorial-rnnlm/blob/master/train-theano.py
https://github.com/dennybritz/rnn-tutorial-rnnlm/blob/master/rnn_theano.py
http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/

3. TRAINING OUR NETWORK WITH THEANO AND GENERATING TEXT

0x1: code

#! /usr/bin/env python

import csv
import itertools
import operator
import numpy as np
import nltk
import sys
import os
import time
from datetime import datetime
from utils import *
from rnn_theano import RNNTheano

_VOCABULARY_SIZE = int(os.environ.get('VOCABULARY_SIZE', '8000'))
_HIDDEN_DIM = int(os.environ.get('HIDDEN_DIM', '80'))
_LEARNING_RATE = float(os.environ.get('LEARNING_RATE', '0.005'))
_NEPOCH = int(os.environ.get('NEPOCH', '100'))
_MODEL_FILE = os.environ.get('MODEL_FILE')

def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=1, evaluate_loss_after=5):
# We keep track of the losses so we can plot them later
losses = []
num_examples_seen = 0
for epoch in range(nepoch):
# Optionally evaluate the loss
if (epoch % evaluate_loss_after == 0):
loss = model.calculate_loss(X_train, y_train)
losses.append((num_examples_seen, loss))
time = datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
print "%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss)
# Adjust the learning rate if loss increases
if (len(losses) > 1 and losses[-1][1] > losses[-2][1]):
learning_rate = learning_rate * 0.5
print "Setting learning rate to %f" % learning_rate
sys.stdout.flush()
save_model_parameters_theano("./data/rnn-theano-%d-%d-%s.npz" % (model.hidden_dim, model.word_dim, time), model)
# For each training example...
for i in range(len(y_train)):
# One SGD step
model.sgd_step(X_train[i], y_train[i], learning_rate)
num_examples_seen += 1

vocabulary_size = _VOCABULARY_SIZE
unknown_token = "UNKNOWN_TOKEN"
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"

# Read the data and append SENTENCE_START and SENTENCE_END tokens
# Split full comments into sentences
sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])
# Append SENTENCE_START and SENTENCE_END
sentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences]
print "Parsed %d sentences." % (len(sentences))

# Tokenize the sentences into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print "Found %d unique words tokens." % len(word_freq.items())

# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])

print "Using vocabulary size %d." % vocabulary_size
print "The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1])

# Replace all words not in our vocabulary with the unknown token
for i, sent in enumerate(tokenized_sentences):
tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent]

# Create the training data
X_train = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences])
y_train = np.asarray([[word_to_index[w] for w in sent[1:]] for sent in tokenized_sentences])

model = RNNTheano(vocabulary_size, hidden_dim=_HIDDEN_DIM)
t1 = time.time()
model.sgd_step(X_train[10], y_train[10], _LEARNING_RATE)
t2 = time.time()
print "SGD Step time: %f milliseconds" % ((t2 - t1) * 1000.)

if _MODEL_FILE != None:

train_with_sgd(model, X_train, y_train, nepoch=_NEPOCH, learning_rate=_LEARNING_RATE)

RNN class

import numpy as np
import theano as theano
import theano.tensor as T
from utils import *
import operator

class RNNTheano:

def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):
self.word_dim = word_dim
self.hidden_dim = hidden_dim
self.bptt_truncate = bptt_truncate
# Randomly initialize the network parameters
U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim))
V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))
# Theano: Created shared variables
self.U = theano.shared(name='U', value=U.astype(theano.config.floatX))
self.V = theano.shared(name='V', value=V.astype(theano.config.floatX))
self.W = theano.shared(name='W', value=W.astype(theano.config.floatX))
# We store the Theano graph here
self.theano = {}
self.__theano_build__()

def __theano_build__(self):
U, V, W = self.U, self.V, self.W
x = T.ivector('x')
y = T.ivector('y')
def forward_prop_step(x_t, s_t_prev, U, V, W):
s_t = T.tanh(U[:,x_t] + W.dot(s_t_prev))
o_t = T.nnet.softmax(V.dot(s_t))
return [o_t[0], s_t]
forward_prop_step,
sequences=x,
outputs_info=[None, dict(initial=T.zeros(self.hidden_dim))],
non_sequences=[U, V, W],
strict=True)

prediction = T.argmax(o, axis=1)
o_error = T.sum(T.nnet.categorical_crossentropy(o, y))

# Assign functions
self.forward_propagation = theano.function([x], o)
self.predict = theano.function([x], prediction)
self.ce_error = theano.function([x, y], o_error)
self.bptt = theano.function([x, y], [dU, dV, dW])

# SGD
learning_rate = T.scalar('learning_rate')
self.sgd_step = theano.function([x,y,learning_rate], [],
updates=[(self.U, self.U - learning_rate * dU),
(self.V, self.V - learning_rate * dV),
(self.W, self.W - learning_rate * dW)])

def calculate_total_loss(self, X, Y):
return np.sum([self.ce_error(x,y) for x,y in zip(X,Y)])

def calculate_loss(self, X, Y):
# Divide calculate_loss by the number of words
num_words = np.sum([len(y) for y in Y])
return self.calculate_total_loss(X,Y)/float(num_words)

def gradient_check_theano(model, x, y, h=0.001, error_threshold=0.01):
# Overwrite the bptt attribute. We need to backpropagate all the way to get the correct gradient
model.bptt_truncate = 1000
# Calculate the gradients using backprop
# List of all parameters we want to chec.
model_parameters = ['U', 'V', 'W']
# Gradient check for each parameter
for pidx, pname in enumerate(model_parameters):
# Get the actual parameter value from the mode, e.g. model.W
parameter_T = operator.attrgetter(pname)(model)
parameter = parameter_T.get_value()
print "Performing gradient check for parameter %s with size %d." % (pname, np.prod(parameter.shape))
# Iterate over each element of the parameter matrix, e.g. (0,0), (0,1), ...
while not it.finished:
ix = it.multi_index
# Save the original value so we can reset it later
original_value = parameter[ix]
# Estimate the gradient using (f(x+h) - f(x-h))/(2*h)
parameter[ix] = original_value + h
parameter_T.set_value(parameter)
parameter[ix] = original_value - h
parameter_T.set_value(parameter)
parameter[ix] = original_value
parameter_T.set_value(parameter)
# The gradient for this parameter calculated using backpropagation
# calculate The relative error: (|x - y|/(|x| + |y|))
# If the error is to large fail the gradient check
if relative_error > error_threshold:
print "Gradient Check ERROR: parameter=%s ix=%s" % (pname, ix)
print "+h Loss: %f" % gradplus
print "-h Loss: %f" % gradminus
print "Relative Error: %f" % relative_error
return
it.iternext()
print "Gradient check for parameter %s passed." % (pname)

utils.py

import numpy as np

def softmax(x):
xt = np.exp(x - np.max(x))
return xt / np.sum(xt)

def save_model_parameters_theano(outfile, model):
U, V, W = model.U.get_value(), model.V.get_value(), model.W.get_value()
np.savez(outfile, U=U, V=V, W=W)
print "Saved model parameters to %s." % outfile

U, V, W = npzfile["U"], npzfile["V"], npzfile["W"]
model.hidden_dim = U.shape[0]
model.word_dim = U.shape[1]
model.U.set_value(U)
model.V.set_value(V)
model.W.set_value(W)
print "Loaded model parameters from %s. hidden_dim=%d word_dim=%d" % (path, U.shape[0], U.shape[1])


0x2: GENERATING TEXT

Now that we have our model we can ask it to generate new text for us! Let’s implement a helper function to generate new sentences:

def generate_sentence(model):
# We start the sentence with the start token
new_sentence = [word_to_index[sentence_start_token]]
# Repeat until we get an end token
while not new_sentence[-1] == word_to_index[sentence_end_token]:
next_word_probs = model.forward_propagation(new_sentence)
sampled_word = word_to_index[unknown_token]
# We don't want to sample unknown words
while sampled_word == word_to_index[unknown_token]:
samples = np.random.multinomial(1, next_word_probs[-1])
sampled_word = np.argmax(samples)
new_sentence.append(sampled_word)
sentence_str = [index_to_word[x] for x in new_sentence[1:-1]]
return sentence_str

num_sentences = 10
senten_min_length = 7

for i in range(num_sentences):
sent = []
# We want long sentences, not sentences with one or two words
while len(sent) &lt; senten_min_length:
sent = generate_sentence(model)
print " ".join(sent)

http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/

4. RNN Extension

0x1: Bidirectional RNNs

Bidirectional RNNs are based on the idea that the output at time t may not only depend on the previous elements in the sequence, but also future elements. For example, to predict a missing word in a sequence you want to look at both the left and the right context.
Bidirectional RNNs are quite simple. They are just two RNNs stacked on top of each other. The output is then computed based on the hidden state of both RNNs.

0x2: Deep (Bidirectional) RNNs

Deep (Bidirectional) RNNs are similar to Bidirectional RNNs, only that we now have multiple layers per time step. In practice this gives us a higher learning capacity (but we also need a lot of training data).

0x3: LSTM networks

LSTM networks are quite popular these days.
LSTMs don’t have a fundamentally different architecture from RNNs, but they use a different function to compute the hidden state(为了规避梯度消失/爆炸问题). The memory in LSTMs are called cells and you can think of them as black boxes that take as input the previous state h_{t-1} and current input x_t. Internally these cells  decide what to keep in (and what to erase from) memory.
They then combine the previous state, the current memory, and the input. It turns out that these types of units are very efficient at capturing long-term dependencies.

http://colah.github.io/posts/2015-08-Understanding-LSTMs/