JoyBeanRobber

导航

李沐动手学深度学习14——循环神经网络基础

循环神经网络

如果说卷积神经网络可以有效的处理空间信息,循环神经网络则可以更好地处理序列信息。

循环神经网络通过引入状态变量存储过去的信息和当前的输入,从而可以确定当前的输出。

自回归模型

1.若想预测xt对应的值,使用固定长度时间跨度的数据xt-1到xt-τ进行预测,这种模型叫自回归模型。

2.在自回归的基础上,设想观测数据x是有一个不可观测的隐变量z驱动,而z本身又遵循某种自回归,这种模型叫隐变量自回归模型。

以下代码使用sin函数与随机噪音生成模拟数据,以τ值为4的两层多层感知机训练简单的自回归模型:

实验发现:

1.在训练集中,模型表现合格(时间<600)

2.在时间大于600后,模型图像趋于一条直线(常数),无法拟合数据。此时算法效果差,原因是错误的累积效应

3.为验证错误的累积效应,将k值设为1,4,16,64,分别计算若模型模拟k步后的效果,发现k大于4后,模型预测效果差

import torch
from torch import nn
from d2l import torch as d2l
import matplotlib.pyplot as plt


def init_weight(m):
    if type(m) == nn.Linear:
        nn.init.xavier_uniform_(m.weight)


def get_net():
    net = nn.Sequential(nn.Linear(4, 10),
                        nn.ReLU(),
                        nn.Linear(10, 1))
    net.apply(init_weight)
    return net


def train(net, train_iter, loss, epochs, lr):
    trainer = torch.optim.Adam(net.parameters(), lr)
    for epoch in range(epochs):
        for X, y in train_iter:
            trainer.zero_grad()
            l = loss(net(x), y)
            l.sum().backward()
            trainer.step()
        print(f'epoch {epoch + 1}, 'f'loss: {d2l.evaluate_loss(net, train_iter, loss):f}')


if __name__ == "__main__":
    T = 1000
    time = torch.arange(1, T + 1, dtype=torch.float32)
    x = torch.sin(0.01 * time) + torch.normal(0, 0.2, (T,))
    d2l.plot(time, [x], 'time', 'x', xlim=[1, 1000], figsize=(6, 3))
    # plt.show()
    # 将τ值设为4,即自回归模型以前4个点作为输入
    tau = 4
    features = torch.zeros((T - tau, tau))
    for i in range(tau):
        features[:, i] = x[i: T - tau + i]
    labels = x[tau:].reshape((-1, 1))

    # 使用前600个数据作为训练集
    batch_size, n_train = 16, 600
    train_iter = d2l.load_array((features[:n_train], labels[:n_train]), batch_size, is_train=True)
    loss = nn.MSELoss(reduction='none')

    net = get_net()
    train(net, train_iter, loss, 5, 0.01)

使用以下代码作文本预处理:

import collections
import re
from d2l import torch as d2l


def read_time_machine():
    with open(d2l.download('time_machine'), 'r') as f:
        lines = f.readlines()
    return [re.sub('[^A-Za-z]+', ' ', line).strip().lower() for line in lines]


def tokenize(lines, token='word'):
    if token == 'word':
        return [line.split() for line in lines]
    elif token == 'char':
        return [list(line) for line in lines]
    else:
        print('error: unknown type of token')


class Vocab:
    def __init__(self, tokens=None, min_freq=0, reserved_tokens=None):
        if tokens is None:
            tokens = []
        if reserved_tokens is None:
            reserved_tokens = []

        counter = count_corpus(tokens)
        self._token_freqs = sorted(counter.items(), key=lambda x:x[1], reverse=True)
        self.idx_to_token = ['<unk>'] + reserved_tokens
        self.token_to_idx = {token: idx for idx, token in enumerate(self.idx_to_token)}
        for token, freq in self._token_freqs:
            if freq < min_freq:
                break
            if token not in self.token_to_idx:
                self.idx_to_token.append(token)
                self.token_to_idx[token] = len(self.idx_to_token) - 1

    def __len__(self):
        return len(self.idx_to_token)

    def __getitem__(self, tokens):
        if not isinstance(tokens, (list, tuple)):
            return self.token_to_idx.get(tokens, self.unk)
        return [self.__getitem__(token) for token in tokens]

    def to_tokens(self, indices):
        if not isinstance(indices, (list, tuple)):
            return self.idx_to_token[indices]
        return [self.idx_to_token[index] for index in indices]

    @property
    def unk(self):
        return 0

    @property
    def token_freq(self):
        return self._token_freqs


def count_corpus(tokens):
    if len(tokens) == 0 or isinstance(tokens[0], list):
        tokens = [token for line in tokens for token in line]
    return collections.Counter(tokens)


def load_corpus_time_machine(max_tokens=-1):
    lines = read_time_machine()
    tokens = tokenize(lines, 'char')
    vocab = Vocab(tokens)
    # 因为时光机器数据集中的每个文本行不一定是一个句子或一个段落,
    # 所以将所有文本行展平到一个列表中
    corpus = [vocab[token] for line in tokens for token in line]
    if max_tokens > 0:
        corpus = corpus[:max_tokens]
    return corpus, vocab


if __name__ == "__main__":
    d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL + 'timemachine.txt', '090b5e7e70c295757f55df93cb0a180b9691891a')
    lines = read_time_machine()
    # print(f'# 文本总行数: {len(lines)}')
    # print(lines[0])
    # print(lines[10])
    tokens = tokenize(lines)
    #for i in range(11):
    #    print(tokens[i])
    vocab = Vocab(tokens)
    print(list(vocab.token_to_idx.items())[:10])
    for i in [0, 10]:
        print('文本: ', tokens[i])
        print('索引: ', vocab[tokens[i]])

 

语言模型:

假设长度为T 的文本序列中的词元依次为x1, x2, . . . , xT 。于是,xt(1 ≤ t ≤ T )可以被认为是文本序列在时间步t处的观测或标签。

在给定这样的文本序列时,语言模型(language model)的目标是估计序列的联合概率:
P (x1, x2, . . . , xT ).

从基本概率规则开始:

 例如:

 

有时要计算给定前几个单词后出现某个单词的条件概率,这些概率就是语言模型的参数。一种稍微不太准确的方法是:

上公式中分子为deep learning单词出现的次数,分母为deep单词出现的次数。不过有时一些不常见的单词组合可能只会在数据集中出现二三次,或者数据集过小,

一种解决方案是执行某种形式的拉普拉斯平滑,在所有计数中添加一个小常量:

其中,ϵ1, ϵ2和ϵ3是超参数。

但是还会遇到其他问题,比如上述方法完全忽略了单词的含义,比如不能处理同义词等情况。因此一个模型如果只是简单地统计先前“看到”的单词序列频率,那么模型面对这种问题肯定是表现不佳的。

回想之前对马尔可夫模型的讨论,如果P (xt+1 | xt, . . . , x1) = P (xt+1 | xt),则序列上的分布满足一阶马尔可夫性质。这种性质推导出了许多可以应用于序列建模的近似公式:

通常,涉及一个、两个和三个变量的概率公式分别被称为 一元语法(unigram)、二元语法(bigram)和三元语法(trigram)模型

 

自然语言统计:

查看10个最常用的单词:

    d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL + 'timemachine.txt', '090b5e7e70c295757f55df93cb0a180b9691891a')
    tokens = p2.tokenize(p2.read_time_machine())
    corpus = [token for line in tokens for token in line]
    vocab = d2l.Vocab(corpus)
    print(vocab.token_freqs[:10])

 查看所有单词的词频

   freqs = [freq for token, freq in vocab.token_freqs]
    d2l.plot(freqs, xlabel='token: x', ylabel='frequency: n(x)',
             xscale='log', yscale='log')
    plt.show()

 

除了前几个单词,单词的词频大概呈指数关系。

再查看二元语法、三元语法的频率分布:

bigram_tokens = [' '.join(pair) for pair in zip(corpus[:-1], corpus[1:])]
bigram_vocab = d2l.Vocab(bigram_tokens)

trigram_tokens = [' '.join(triple) for triple in zip(
corpus[:-2], corpus[1:-1], corpus[2:])]
trigram_vocab = d2l.Vocab(trigram_tokens)

bigram_freqs = [freq for token, freq in bigram_vocab.token_freqs]
trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs]
d2l.plot([freqs, bigram_freqs, trigram_freqs], xlabel='token: x',
ylabel='frequency: n(x)', xscale='log', yscale='log',
legend=['unigram', 'bigram', 'trigram'])
plt.show()

发现:

除了一元语法词,n元语法词的频率也呈指数分布。

n元组的数量没有变多,说明语言不是随机的组合,而是更具结构化。

因为之前提到的问题,拉普拉斯平滑不适合语言建模。因此使用基于深度学习的模型

 

循环神经网络:
带隐状态的循环神经网络:

某时刻的隐状态的计算,是根据1.前一时刻的隐状态,2.当前状态,3.偏置,计算出来的。隐状态的维度取决于num_hiddens参数设置

困惑度:

对困惑度的理解是,下一个词元的实际选择数的调和平均数,即平均面临n个等概率选择。使用困惑度度量模型的质量。

 

动手实现循环神经网络的代码:

import math
import torch
from torch import nn
from torch.nn import functional as F
import matplotlib.pyplot as plt
from d2l import torch as d2l
import p3_lnn_dataSet as p3


def get_params(vocab_size, num_hiddens, device):
    num_inputs = num_outputs = vocab_size

    def normal(shape):
        return torch.randn(size=shape, device=device) * 0.01

    # 隐藏层参数
    W_xh = normal((num_inputs, num_hiddens))
    W_hh = normal((num_hiddens, num_hiddens))
    b_h = torch.zeros(num_hiddens, device=device)
    # 输出层参数
    W_hq = normal((num_hiddens, num_outputs))
    b_q = torch.zeros(num_outputs, device=device)
    # 附加梯度
    params = [W_xh, W_hh, b_h, W_hq, b_q]
    for param in params:
        param.requires_grad_(True)
    return params


def init_rnn_state(batch_size, num_hiddens, device):
    return (torch.zeros((batch_size, num_hiddens), device=device), )


def rnn(inputs, state, params):
    # inputs的形状:(时间步数量,批量大小,词表大小)
    W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state
    outputs = []
    # X的形状:(批量大小,词表大小)
    for X in inputs:
        H = torch.tanh(torch.mm(X, W_xh) + torch.mm(H, W_hh) + b_h)
        Y = torch.mm(H, W_hq) + b_q
        outputs.append(Y)
    return torch.cat(outputs, dim=0), (H,)


class RNNModelScratch:
    """从零开始实现的循环神经网络模型"""
    def __init__(self, vocab_size, num_hiddens, device, get_params, init_state, forward_fn):
        self.vocab_size, self.num_hiddens = vocab_size, num_hiddens
        self.params = get_params(vocab_size, num_hiddens, device)
        self.init_state, self.forward_fn = init_state, forward_fn

    def __call__(self, X, state):
        X = F.one_hot(X.T, self.vocab_size).type(torch.float32)
        return self.forward_fn(X, state, self.params)

    def begin_state(self, batch_size, device):
        return self.init_state(batch_size, self.num_hiddens, device)


def predict_ch8(prefix, num_preds, net, vocab, device):
    """在prefix后面生成新字符"""
    state = net.begin_state(batch_size=1, device=device)
    outputs = [vocab[prefix[0]]]
    get_input = lambda: torch.tensor([outputs[-1]], device=device).reshape((1, 1))
    for y in prefix[1:]: # 预热期
        _, state = net(get_input(), state)
        outputs.append(vocab[y])
    for _ in range(num_preds): # 预测num_preds步
        y, state = net(get_input(), state)
        outputs.append(int(y.argmax(dim=1).reshape(1)))
    return ''.join([vocab.idx_to_token[i] for i in outputs])


def grad_clipping(net, theta): #@save
    """裁剪梯度"""
    if isinstance(net, nn.Module):
        params = [p for p in net.parameters() if p.requires_grad]
    else:
        params = net.params
    norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
    if norm > theta:
        for param in params:
            param.grad[:] *= theta / norm


def train_epoch_ch8(net, train_iter, loss, updater, device, use_random_iter):
    """训练网络一个迭代周期(定义见第8章)"""
    state, timer = None, d2l.Timer()
    metric = d2l.Accumulator(2) # 训练损失之和,词元数量
    for X, Y in train_iter:
        if state is None or use_random_iter:
        # 在第一次迭代或使用随机抽样时初始化state
            state = net.begin_state(batch_size=X.shape[0], device=device)
        else:
            if isinstance(net, nn.Module) and not isinstance(state, tuple):
                # state对于nn.GRU是个张量
                state.detach_()
            else:
                # state对于nn.LSTM或对于我们从零开始实现的模型是个张量
                for s in state:
                    s.detach_()
        y = Y.T.reshape(-1)
        X, y = X.to(device), y.to(device)
        y_hat, state = net(X, state)
        l = loss(y_hat, y.long()).mean()
        if isinstance(updater, torch.optim.Optimizer):
            updater.zero_grad()
            l.backward()
            grad_clipping(net, 1)
            updater.step()
        else:
            l.backward()
            grad_clipping(net, 1)
            # 因为已经调用了mean函数
            updater(batch_size=1)
        metric.add(l * y.numel(), y.numel())
    return math.exp(metric[0] / metric[1]), metric[1] / timer.stop()


def train_ch8(net, train_iter, vocab, lr, num_epochs, device,
    use_random_iter=False):
    """训练模型(定义见第8章)"""
    loss = nn.CrossEntropyLoss()
    animator = d2l.Animator(xlabel='epoch', ylabel='perplexity',
    legend=['train'], xlim=[10, num_epochs])
    # 初始化
    if isinstance(net, nn.Module):
        updater = torch.optim.SGD(net.parameters(), lr)
    else:
        updater = lambda batch_size: d2l.sgd(net.params, lr, batch_size)
    predict = lambda prefix: predict_ch8(prefix, 50, net, vocab, device)
    # 训练和预测
    for epoch in range(num_epochs):
        ppl, speed = train_epoch_ch8(
            net, train_iter, loss, updater, device, use_random_iter)
        if (epoch + 1) % 10 == 0:
            print(predict('time traveller'))
            animator.add(epoch + 1, [ppl])
    print(f'困惑度 {ppl:.1f}, {speed:.1f} 词元/秒 {str(device)}')
    print(predict('time traveller'))
    print(predict('traveller'))


if __name__ == "__main__":
    d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL + 'timemachine.txt', '090b5e7e70c295757f55df93cb0a180b9691891a')
    batch_size, num_steps = 32, 35
    train_iter, vocab = p3.load_data_time_machine(batch_size, num_steps)

    X = torch.arange(10).reshape((2, 5))

    num_hiddens = 512
    net = RNNModelScratch(len(vocab), num_hiddens, d2l.try_gpu(), get_params,
                          init_rnn_state, rnn)
    state = net.begin_state(X.shape[0], d2l.try_gpu())
    Y, new_state = net(X.to(d2l.try_gpu()), state)
    num_epochs, lr = 500, 1
    train_ch8(net, train_iter, vocab, lr, num_epochs, d2l.try_gpu())
    plt.show()

运行结果:

 

posted on 2025-05-29 11:53  欢乐豆掠夺者  阅读(45)  评论(0)    收藏  举报