NLP FROM SCRATCH: CLASSIFYING NAMES WITH A CHARACTER-LEVEL RNN

翻译总结于 pytorch官网      Author: Sean Robertson

构建一个基础的character-level RNN来分类单词。本文和接下来的两篇文章将会从头（数据处理）开始构建NLP模型。特别的没有利用各种已有的操作（torchtext），所以可以从底层来学习NLP模型的处理过程。

character-level RNN按照序列读取单词，每步输出一个预测和“隐藏态”（hidden state），将前一个隐藏态输入到下一步。利用最终的预测作为输出，即预测的类别。

特别的，将对来自18种语言的几千个姓氏进行训练，并根据拼写来预测该姓氏来自哪种语言。例如：

推荐阅读:

I assume you have at least installed PyTorch, know Python, and understand Tensors:

https://pytorch.org/ For installation instructions
Deep Learning with PyTorch: A 60 Minute Blitz to get started with PyTorch in general
Learning PyTorch with Examples for a wide and deep overview
PyTorch for Former Torch Users if you are former Lua Torch user

It would also be useful to know about RNNs and how they work:

The Unreasonable Effectiveness of Recurrent Neural Networks shows a bunch of real life examples
Understanding LSTM Networks is about LSTMs specifically but also informative about RNNs in general

数据准备

数据下载：Download the data from here and extract it to the current directory.

在data/names文件下有18个txt文件，以[language].txt命名。每个文件包含了一些名字，一行一个名字（需要将其从unicode转为ascii）。

最后生成字典列表，每个字典元素都是：{language：[names...]}。字典格式：变量“category”和“line”（表示语言和名字）在后面会用到。

 1 # -*- coding: utf-8 -*-
 2 """
 3 Created on Tue Apr 14 11:12:13 2020
 4 
 5 @author: lenovo
 6 """
 7 
 8 from __future__ import unicode_literals, print_function, division
 9 from io import open
10 import glob
11 import os
12 
13 def findFiles(path): return glob.glob(path)
14 
15 print(findFiles('data/names/*.txt'))
16 
17 import unicodedata
18 import string
19 
20 all_letters = string.ascii_letters + " .,;'"
21 n_letters = len(all_letters)         # 57
22 
23 # Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
24 def unicodeToAscii(s):
25     return ''.join(
26         c for c in unicodedata.normalize('NFD', s)
27         if unicodedata.category(c) != 'Mn'
28         and c in all_letters
29     )
30 
31 print(unicodeToAscii('Ślusàrski'))
32 
33 # Build the category_lines dictionary, a list of names per language
34 category_lines = {}
35 all_categories = []
36 
37 # Read a file and split into lines
38 def readLines(filename):
39     lines = open(filename, encoding='utf-8').read().strip().split('\n')
40     return [unicodeToAscii(line) for line in lines]
41 
42 for filename in findFiles('data/names/*.txt'):
43     category = os.path.splitext(os.path.basename(filename))[0]
44     all_categories.append(category)
45     lines = readLines(filename)
46     category_lines[category] = lines     # 字典 语言到名字
47 
48 n_categories = len(all_categories)   # 字典长度 18

category_lines变量里面就是一个字典，包括从category（不同语言）到名字names（列表）的映射。同时all_categories就是所有语言的list，而n_categories是list长度，即语言数。

将名字转为Tensor

为利用数据，需要将name转为tensor。利用独热编码实现编码字母，例如b就是<0 1 0 0 0 ...>。所以一个名字就是2维的矩阵（由字母组成）。在批量这个维度上有维度1（pytorch语法格式）。

 1 import torch
 2 
 3 # Find letter index from all_letters, e.g. "a" = 0
 4 def letterToIndex(letter):
 5     return all_letters.find(letter)
 6 
 7 # Just for demonstration, turn a letter into a <1 x n_letters> Tensor
 8 def letterToTensor(letter):
 9     tensor = torch.zeros(1, n_letters)
10     tensor[0][letterToIndex(letter)] = 1
11     return tensor
12 
13 # Turn a line into a <line_length x 1 x n_letters>,
14 # or an array of one-hot letter vectors
15 def lineToTensor(line):
16     tensor = torch.zeros(len(line), 1, n_letters)
17     for li, letter in enumerate(line):
18         tensor[li][0][letterToIndex(letter)] = 1
19     return tensor
20 
21 print(letterToTensor('J'))
22 
23 print(lineToTensor('Jones').size())

建立网络

在自动求导前，toch里构建循环神经网络包括在几个时间步骤中复制网络层参数。该网络层包括隐藏状态和梯度。因为是自动求导，所以可以构建一个非常纯的RNN实现。本文的RNN模型主要copy自the PyTorch for Torch users tutorial，仅仅有两层组成。包括输入、隐藏态、logsoftmax层。

 1 class RNN(nn.Module):
 2     def __init__(self, input_size, hidden_size, output_size):
 3         super(RNN, self).__init__()
 4 
 5         self.hidden_size = hidden_size
 6 
 7         self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
 8         self.i2o = nn.Linear(input_size + hidden_size, output_size)
 9         self.softmax = nn.LogSoftmax(dim=1)
10 
11     def forward(self, input, hidden):
12         combined = torch.cat((input, hidden), 1)
13         hidden = self.i2h(combined)
14         output = self.i2o(combined)
15         output = self.softmax(output)
16         return output, hidden
17 
18     def initHidden(self):
19         return torch.zeros(1, self.hidden_size)
20 
21 n_hidden = 128
22 rnn = RNN(n_letters, n_hidden, n_categories)

这步需要给一个输入（当前字母的tensor）和之前的隐藏态（最初初始化为全零）。得到的是输出（每个语言的概率）和下一个隐藏态（存起来为了下一步）。

1 input = letterToTensor('A')
2 hidden =torch.zeros(1, n_hidden)
3 
4 output, next_hidden = rnn(input, hidden)

为了效率，不想每一步都创一个新的tensor，利用lineToTensor而不是letterToTensor，并利用切片。预处理批量tensors时可以进一步优化。

1 input = lineToTensor('Albert')
2 hidden = torch.zeros(1, n_hidden)
3 
4 output, next_hidden = rnn(input[0], hidden)
5 print(output)

这个就是对所有类别的预测，给定一个名字，哪个概率最高，就属于哪类。

训练（准备工作）

正式训练之前需要一些帮助函数。首先是网络的输出，即类别概率，利用Tensor.topk来得到最大值的索引。

1 def categoryFromOutput(output):
2     top_n, top_i = output.topk(1)
3     category_i = top_i[0].item()
4     return all_categories[category_i], category_i
5 
6 print(categoryFromOutput(output))

以下是一个随即整的训练样本：

 1 import random
 2 
 3 def randomChoice(l):
 4     return l[random.randint(0, len(l) - 1)]
 5 
 6 def randomTrainingExample():
 7     category = randomChoice(all_categories)
 8     line = randomChoice(category_lines[category])
 9     category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
10     line_tensor = lineToTensor(line)
11     return category, line, category_tensor, line_tensor
12 
13 for i in range(10):
14     category, line, category_tensor, line_tensor = randomTrainingExample()
15     print('category =', category, '/ line =', line)

训练网络

对于损失函数，可以利用nn.NLLLoss，RNN的最后一层为nn.LogSoftmax

criterion = nn.NLLLoss()

每个循环过程分别执行：

创建输入和目标tensor
创建初始的隐藏态为全零
读取每个字母并为下个字母保持隐藏态
对比输出和目标
反向传播
输出loss

 1 learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn
 2 
 3 def train(category_tensor, line_tensor):
 4     hidden = rnn.initHidden()
 5 
 6     rnn.zero_grad()
 7 
 8     for i in range(line_tensor.size()[0]):
 9         output, hidden = rnn(line_tensor[i], hidden)
10 
11     loss = criterion(output, category_tensor)
12     loss.backward()
13 
14     # Add parameters' gradients to their values, multiplied by learning rate
15     for p in rnn.parameters():
16         p.data.add_(-learning_rate, p.grad.data)
17 
18     return output, loss.item()

现在可以跑了。train函数返回输出和损失，可以打印一波看看。因为有1000多个样本，仅仅打印一部分看看。

 1 import time
 2 import math
 3 
 4 n_iters = 100000
 5 print_every = 5000
 6 plot_every = 1000
 7 
 8 
 9 
10 # Keep track of losses for plotting
11 current_loss = 0
12 all_losses = []
13 
14 def timeSince(since):
15     now = time.time()
16     s = now - since
17     m = math.floor(s / 60)
18     s -= m * 60
19     return '%dm %ds' % (m, s)
20 
21 start = time.time()
22 
23 for iter in range(1, n_iters + 1):
24     category, line, category_tensor, line_tensor = randomTrainingExample()
25     output, loss = train(category_tensor, line_tensor)
26     current_loss += loss
27 
28     # Print iter number, loss, name and guess
29     if iter % print_every == 0:
30         guess, guess_i = categoryFromOutput(output)
31         correct = '✓' if guess == category else '✗ (%s)' % category
32         print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))
33 
34     # Add current loss avg to list of losses
35     if iter % plot_every == 0:
36         all_losses.append(current_loss / plot_every)
37         current_loss = 0

画出结果图

画出累计的损失值

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

plt.figure()
plt.plot(all_losses)

结果评估

为了对比在不同类别的表现，构建一个混淆矩阵，每行表示一个语言，列为各个语言的预测。为了计算混合矩阵，使用evaluate（）在网络中运行一组样本，与train（）中除去backprop相同。

 1 # Keep track of correct guesses in a confusion matrix
 2 confusion = torch.zeros(n_categories, n_categories)
 3 n_confusion = 10000
 4 
 5 # Just return an output given a line
 6 def evaluate(line_tensor):
 7     hidden = rnn.initHidden()
 8 
 9     for i in range(line_tensor.size()[0]):
10         output, hidden = rnn(line_tensor[i], hidden)
11 
12     return output
13 
14 # Go through a bunch of examples and record which are correctly guessed
15 for i in range(n_confusion):
16     category, line, category_tensor, line_tensor = randomTrainingExample()
17     output = evaluate(line_tensor)
18     guess, guess_i = categoryFromOutput(output)
19     category_i = all_categories.index(category)
20     confusion[category_i][guess_i] += 1
21 
22 # Normalize by dividing every row by its sum
23 for i in range(n_categories):
24     confusion[i] = confusion[i] / confusion[i].sum()
25 
26 # Set up plot
27 fig = plt.figure()
28 ax = fig.add_subplot(111)
29 cax = ax.matshow(confusion.numpy())
30 fig.colorbar(cax)
31 
32 # Set up axes
33 ax.set_xticklabels([''] + all_categories, rotation=90)
34 ax.set_yticklabels([''] + all_categories)
35 
36 # Force label at every tick
37 ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
38 ax.yaxis.set_major_locator(ticker.MultipleLocator(1))
39 
40 # sphinx_gallery_thumbnail_number = 2
41 plt.show()

用户预测

 1 def predict(input_line, n_predictions=3):
 2     print('\n> %s' % input_line)
 3     with torch.no_grad():
 4         output = evaluate(lineToTensor(input_line))
 5 
 6         # Get top N categories
 7         topv, topi = output.topk(n_predictions, 1, True)
 8         predictions = []
 9 
10         for i in range(n_predictions):
11             value = topv[0][i].item()
12             category_index = topi[0][i].item()
13             print('(%.2f) %s' % (value, all_categories[category_index]))
14             predictions.append([value, all_categories[category_index]])
15 
16 predict('Dovesky')
17 predict('Jackson')
18 predict('Satoshi')

The final versions of the scripts in the Practical PyTorch repo split the above code into a few files:

data.py (loads files)
model.py (defines the RNN)
train.py (runs training)
predict.py (runs predict() with command line arguments)
server.py (serve prediction as a JSON API with bottle.py)

Run train.py to train and save the network.

Run predict.py with a name to view predictions:

Run server.py and visit http://localhost:5533/Yourname to get JSON output of predictions.

Exercises

Try with a different dataset of line -> category, for example:
- Any word -> language
- First name -> gender
- Character name -> writer
- Page title -> blog or subreddit
Get better results with a bigger and/or better shaped network
- Add more linear layers
- Try the nn.LSTM and nn.GRU layers
- Combine multiple of these RNNs as a higher level network

Total running time of the script: ( 3 minutes 6.446 seconds)

整体代码：

  1 # -*- coding: utf-8 -*-
  2 """
  3 Created on Tue Apr 14 11:12:13 2020
  4 
  5 @author: lenovo
  6 """
  7 
  8 from __future__ import unicode_literals, print_function, division
  9 from io import open
 10 import glob
 11 import os
 12 
 13 def findFiles(path): return glob.glob(path)
 14 
 15 print(findFiles('data/names/*.txt'))
 16 
 17 import unicodedata
 18 import string
 19 
 20 all_letters = string.ascii_letters + " .,;'"
 21 n_letters = len(all_letters)
 22 
 23 # Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
 24 def unicodeToAscii(s):
 25     return ''.join(
 26         c for c in unicodedata.normalize('NFD', s)
 27         if unicodedata.category(c) != 'Mn'
 28         and c in all_letters
 29     )
 30 
 31 print(unicodeToAscii('Ślusàrski'))
 32 
 33 # Build the category_lines dictionary, a list of names per language
 34 category_lines = {}
 35 all_categories = []
 36 
 37 # Read a file and split into lines
 38 def readLines(filename):
 39     lines = open(filename, encoding='utf-8').read().strip().split('\n')
 40     return [unicodeToAscii(line) for line in lines]
 41 
 42 for filename in findFiles('data/names/*.txt'):
 43     category = os.path.splitext(os.path.basename(filename))[0]
 44     all_categories.append(category)
 45     lines = readLines(filename)
 46     category_lines[category] = lines
 47 
 48 n_categories = len(all_categories)
 49 
 50 
 51 
 52 import torch
 53 
 54 # Find letter index from all_letters, e.g. "a" = 0
 55 def letterToIndex(letter):
 56     return all_letters.find(letter)
 57 
 58 # Just for demonstration, turn a letter into a <1 x n_letters> Tensor
 59 def letterToTensor(letter):
 60     tensor = torch.zeros(1, n_letters)
 61     tensor[0][letterToIndex(letter)] = 1
 62     return tensor
 63 
 64 # Turn a line into a <line_length x 1 x n_letters>,
 65 # or an array of one-hot letter vectors
 66 def lineToTensor(line):
 67     tensor = torch.zeros(len(line), 1, n_letters)
 68     for li, letter in enumerate(line):
 69         tensor[li][0][letterToIndex(letter)] = 1
 70     return tensor
 71 
 72 print(letterToTensor('J'))
 73 
 74 print(lineToTensor('Jones').size())
 75 
 76 
 77 import torch.nn as nn
 78 
 79 class RNN(nn.Module):
 80     def __init__(self, input_size, hidden_size, output_size):
 81         super(RNN, self).__init__()
 82 
 83         self.hidden_size = hidden_size
 84 
 85         self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
 86         self.i2o = nn.Linear(input_size + hidden_size, output_size)
 87         self.softmax = nn.LogSoftmax(dim=1)
 88 
 89     def forward(self, input, hidden):
 90         combined = torch.cat((input, hidden), 1)
 91         hidden = self.i2h(combined)
 92         output = self.i2o(combined)
 93         output = self.softmax(output)
 94         return output, hidden
 95 
 96     def initHidden(self):
 97         return torch.zeros(1, self.hidden_size)
 98 
 99 n_hidden = 128
100 rnn = RNN(n_letters, n_hidden, n_categories)
101 
102 input = letterToTensor('A')
103 hidden =torch.zeros(1, n_hidden)
104 
105 output, next_hidden = rnn(input, hidden)
106 
107 def categoryFromOutput(output):
108     top_n, top_i = output.topk(1)
109     category_i = top_i[0].item()
110     return all_categories[category_i], category_i
111 
112 print(categoryFromOutput(output))
113 
114 import random
115 
116 def randomChoice(l):
117     return l[random.randint(0, len(l) - 1)]
118 
119 def randomTrainingExample():
120     category = randomChoice(all_categories)
121     line = randomChoice(category_lines[category])
122     category_tensor = torch.tensor([all_categories.index(category)], dtype=torch.long)
123     line_tensor = lineToTensor(line)
124     return category, line, category_tensor, line_tensor
125 
126 for i in range(10):
127     category, line, category_tensor, line_tensor = randomTrainingExample()
128     print('category =', category, '/ line =', line)
129     
130 criterion = nn.NLLLoss()
131 
132 learning_rate = 0.005 # If you set this too high, it might explode. If too low, it might not learn
133 
134 def train(category_tensor, line_tensor):
135     hidden = rnn.initHidden()
136 
137     rnn.zero_grad()
138 
139     for i in range(line_tensor.size()[0]):
140         output, hidden = rnn(line_tensor[i], hidden)
141 
142     loss = criterion(output, category_tensor)
143     loss.backward()
144 
145     # Add parameters' gradients to their values, multiplied by learning rate
146     for p in rnn.parameters():
147         p.data.add_(-learning_rate, p.grad.data)
148 
149     return output, loss.item()
150     
151 
152 import time
153 import math
154 
155 n_iters = 100000
156 print_every = 5000
157 plot_every = 1000
158 
159 
160 
161 # Keep track of losses for plotting
162 current_loss = 0
163 all_losses = []
164 
165 def timeSince(since):
166     now = time.time()
167     s = now - since
168     m = math.floor(s / 60)
169     s -= m * 60
170     return '%dm %ds' % (m, s)
171 
172 start = time.time()
173 
174 for iter in range(1, n_iters + 1):
175     category, line, category_tensor, line_tensor = randomTrainingExample()
176     output, loss = train(category_tensor, line_tensor)
177     current_loss += loss
178 
179     # Print iter number, loss, name and guess
180     if iter % print_every == 0:
181         guess, guess_i = categoryFromOutput(output)
182         correct = '✓' if guess == category else '✗ (%s)' % category
183         print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))
184 
185     # Add current loss avg to list of losses
186     if iter % plot_every == 0:
187         all_losses.append(current_loss / plot_every)
188         current_loss = 0
189         
190 import time
191 import math
192 
193 n_iters = 100000
194 print_every = 5000
195 plot_every = 1000
196 
197 
198 
199 # Keep track of losses for plotting
200 current_loss = 0
201 all_losses = []
202 
203 def timeSince(since):
204     now = time.time()
205     s = now - since
206     m = math.floor(s / 60)
207     s -= m * 60
208     return '%dm %ds' % (m, s)
209 
210 start = time.time()
211 
212 for iter in range(1, n_iters + 1):
213     category, line, category_tensor, line_tensor = randomTrainingExample()
214     output, loss = train(category_tensor, line_tensor)
215     current_loss += loss
216 
217     # Print iter number, loss, name and guess
218     if iter % print_every == 0:
219         guess, guess_i = categoryFromOutput(output)
220         correct = '✓' if guess == category else '✗ (%s)' % category
221         print('%d %d%% (%s) %.4f %s / %s %s' % (iter, iter / n_iters * 100, timeSince(start), loss, line, guess, correct))
222 
223     # Add current loss avg to list of losses
224     if iter % plot_every == 0:
225         all_losses.append(current_loss / plot_every)
226         current_loss = 0
227         
228 import matplotlib.pyplot as plt
229 import matplotlib.ticker as ticker
230 
231 plt.figure()
232 plt.plot(all_losses)

View Code

posted @ 2020-04-15 17:32 三年一梦阅读(746) 评论(0) 收藏举报

刷新页面返回顶部

三年一梦

NLP FROM SCRATCH: CLASSIFYING NAMES WITH A CHARACTER-LEVEL RNN

Exercises

公告