读书报告

Transformer原理学习报告

一、学习背景与目的

Transformer作为深度学习领域的里程碑式模型，自2017年《Attention Is All You Need》论文提出后，成为自然语言处理、计算机视觉等领域的核心架构。本次通过B站《Transformer原理最强动画讲解》视频（BV1fGeAz6Eie）系统学习其核心原理，包括自注意力机制、编码器-解码器结构、多头注意力等关键模块，并通过代码实践加深对模型底层逻辑的理解，旨在掌握Transformer的工作机制及实现思路。

二、核心知识点梳理

1. 自注意力机制（Self-Attention）
自注意力是Transformer的核心，能为输入序列中每个位置计算与其他位置的关联权重，捕捉序列内部的依赖关系。其计算分为三步：生成查询（Query）、键（Key）、值（Value）向量；计算注意力分数并通过Softmax归一化；加权求和得到注意力输出。
2. 多头注意力（Multi-Head Attention）
将注意力机制拆分为多个“头”，每个头学习不同的特征关联模式，再将结果拼接，提升模型对复杂特征的捕捉能力。
3. 编码器-解码器结构
编码器由多层自注意力和前馈神经网络组成，负责对输入序列编码；解码器在编码器基础上，增加掩码多头注意力（防止未来信息泄露），完成序列生成任务。
4. 位置编码（Positional Encoding）
Transformer无循环/卷积结构，需通过位置编码为输入序列注入位置信息，常用正弦余弦函数或可学习的位置嵌入实现。

三、代码实践：实现简易自注意力与多头注意力

（一）环境准备

本次实践基于Python 3.10+PyTorch 2.0，需先安装依赖：

bash

pip install torch numpy

（二）自注意力实现

python

import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttention(nn.Module):
def init(self, d_model, dropout=0.1):
super(SelfAttention, self).init()
# d_model为特征维度，需能被注意力头数整除（此处先实现单头）
self.d_model = d_model
self.q_linear = nn.Linear(d_model, d_model)
self.k_linear = nn.Linear(d_model, d_model)
self.v_linear = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
self.scale = torch.sqrt(torch.FloatTensor([d_model]))

def forward(self, x, mask=None):
    # x: [batch_size, seq_len, d_model]
    batch_size, seq_len, _ = x.shape
    
    # 生成Q、K、V
    Q = self.q_linear(x)  # [batch_size, seq_len, d_model]
    K = self.k_linear(x)
    V = self.v_linear(x)
    
    # 计算注意力分数：Q @ K^T / sqrt(d_model)
    attention_scores = torch.matmul(Q, K.permute(0, 2, 1)) / self.scale.to(x.device)
    
    # 掩码处理（防止关注padding或未来位置）
    if mask is not None:
        attention_scores = attention_scores.masked_fill(mask == 0, -1e9)
    
    # Softmax归一化得到注意力权重
    attention_weights = F.softmax(attention_scores, dim=-1)
    attention_weights = self.dropout(attention_weights)
    
    # 加权求和得到输出
    output = torch.matmul(attention_weights, V)  # [batch_size, seq_len, d_model]
    return output, attention_weights

测试单头自注意力

if name == "main":
d_model = 512
seq_len = 10
batch_size = 2
x = torch.randn(batch_size, seq_len, d_model)
attention = SelfAttention(d_model)
output, weights = attention(x)
print(f"自注意力输出形状: {output.shape}")
print(f"注意力权重形状: {weights.shape}")

（三）多头注意力实现

python

class MultiHeadAttention(nn.Module):
def init(self, d_model, n_heads, dropout=0.1):
super(MultiHeadAttention, self).init()
assert d_model % n_heads == 0, "d_model必须能被n_heads整除"

    self.d_model = d_model
    self.n_heads = n_heads
    self.d_head = d_model // n_heads  # 每个头的特征维度
    
    self.q_linear = nn.Linear(d_model, d_model)
    self.k_linear = nn.Linear(d_model, d_model)
    self.v_linear = nn.Linear(d_model, d_model)
    self.out_linear = nn.Linear(d_model, d_model)
    self.dropout = nn.Dropout(dropout)
    self.scale = torch.sqrt(torch.FloatTensor([self.d_head]))

def split_heads(self, x):
    # 拆分多头：[batch_size, seq_len, d_model] -> [batch_size, n_heads, seq_len, d_head]
    batch_size, seq_len, _ = x.shape
    return x.view(batch_size, seq_len, self.n_heads, self.d_head).permute(0, 2, 1, 3)

def forward(self, x, mask=None):
    batch_size, seq_len, _ = x.shape
    
    # 生成Q、K、V并拆分多头
    Q = self.split_heads(self.q_linear(x))
    K = self.split_heads(self.k_linear(x))
    V = self.split_heads(self.v_linear(x))
    
    # 计算注意力分数
    attention_scores = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale.to(x.device)
    
    # 掩码处理
    if mask is not None:
        attention_scores = attention_scores.masked_fill(mask == 0, -1e9)
    
    # 归一化与dropout
    attention_weights = F.softmax(attention_scores, dim=-1)
    attention_weights = self.dropout(attention_weights)
    
    # 加权求和并拼接多头
    output = torch.matmul(attention_weights, V)
    output = output.permute(0, 2, 1, 3).contiguous()
    output = output.view(batch_size, seq_len, self.d_model)
    
    # 最终线性层输出
    output = self.out_linear(output)
    return output, attention_weights

测试多头注意力

if name == "main":
d_model = 512
n_heads = 8
seq_len = 10
batch_size = 2
x = torch.randn(batch_size, seq_len, d_model)
multi_attention = MultiHeadAttention(d_model, n_heads)
output, weights = multi_attention(x)
print(f"多头注意力输出形状: {output.shape}")
print(f"多头注意力权重形状: {weights.shape}")

四、学习收获与总结

通过视频学习与代码实践，我系统掌握了Transformer的核心原理：自注意力机制通过计算序列内部关联实现特征建模，多头注意力进一步增强了模型的特征表达能力，编码器-解码器结构则适配了序列到序列的任务场景。代码实现过程中，我理解了注意力分数计算、掩码处理、多头拆分与拼接等关键细节，也认识到位置编码、残差连接、层归一化等组件对Transformer性能的重要性。

同时，视频中的动画讲解让抽象的注意力机制变得直观，解决了我对“注意力权重如何反映序列依赖”的困惑。后续将继续学习Transformer的变体（如BERT、GPT），并尝试将其应用于文本分类、机器翻译等实际任务，进一步深化对模型的理解。

五、拓展方向

1. 实现完整的Transformer编码器-解码器结构，结合位置编码完成机器翻译任务；
2. 探究注意力可视化方法，直观展示模型对序列的关注重点；
3. 学习Transformer在计算机视觉中的应用（如Vision Transformer），对比其与CNN的差异。

posted @ 2025-12-26 13:00 不要命的蛋阅读(3) 评论(0) 收藏举报

刷新页面返回顶部

读书报告

测试单头自注意力

测试多头注意力

公告