LLM从零开始

本博客主要记录Transformer的学习，尽可能实现Transformer的各个结构

参考资料：Datawhale Happy LLM，复旦大学-大规模语言模型：从理论到实践，李沐-动手学深度学习

1. LLM训练大致流程

数据收集与清洗：去除广告，重复文本，低质内容，语言标准化
分词与词表构建：从预料统计子词频率，构建词表，此时文本可以进行Token ID映射
数据预处理/训练数据划分
预训练：自回归（文本生成，对话，创作：GPT），自编码（文本理解，分类，NER：BET），编码-解码（翻译，摘要：T5，BART）
监督微调SFT
对齐优化RLHF/DPO

2. 自注意力机制以及其他组件

在使用注意力机制之前，文本模型普遍使用RNN网络（循环神经网络）训练，RNN/LSTM逐步传递信息，因此远端的信息往往难以向后聚合，也就是出现长距离依赖问题
自注意力机制，通过计算输入序列中每个token对其他token的注意力分数，来获取token之间的关系，无论距离多远，都可以直接计算得到，并且支持全系列并行计算

2.1 详解自注意力机制

计算步骤如下：

对于一段文本输入，首先分词获取其token，然后通过embedding生成其词向量
此时输入序列为 $ X = [x_1, x_2, ... , x_n]$，其中每个 $x_i$ 的维度为$d$的词向量
分别通过权重矩阵$W_q, W_k, W_v$映射到Q，K，V矩阵，例如$Query=X·W_q$
计算注意力分数：$Score = Q·K^T$，点积衡量Q与K的相关性
缩放分数：$Score / √d_k$，防止点积过大导致Softmax梯度消失
归一化权重，$attention\_weights = softmax(Score)$，使得每行权重和为1
加权聚合：$Z_out = attention_weights·V $，此时输出的每一行是当前词融合全局信息的新表示

Q,K,V矩阵，是由输入序列，经过与不同的权重矩阵相乘得到的，也就是通过一次线性变换得到

为什么使用点积衡量相似度：保留方向相似性，并且考虑到了向量长度的影响，余弦相似度仅保留了方向信息，而忽略了向量长度

缩放点积时一定程度包含了长度归一化的效果

2.2 Q, K, V的理解：

Query：表示当前处理位置所关注的需求
Key：表示序列中所欲位置的特征，用来计算与Q的匹配相似性
Value：携带实际语义内容的载体，用于最终的加权聚合
Q，K，V均由输入序列线性变换得到，参数矩阵可训练，三次线性变换，旨在将输入序列映射到聚焦于需求Q，标识K，内容V三个不同的维度

2.3 注意力机制代码实现

# 假设已经计算得到了 Q，K，V（线性变换得到）
# mask操作后续讲解
# num_heads为1，单头注意力机制
import torch
import torch.nn as nn
import torch.nn.functional as F

class ScaledDotProductAttention(nn.Module):
  def __init__(self, dropout_rate=0.1):
    super().__init__()
    self.dropout = nn.Dropout(dropout_rate)

  def forward(slef, Q, K, V, mask=None):
    # Q: [batch_size, num_heads, seq_len_q, d_k]
    # K: [batch_size, num_heads, seq_len_k, d_k]
    # V: [batch_size, num_heads, seq_len_v, d_v] (通常 seq_len_k = seq_len_v)
    
    # 1. 计算Q与K的点积相似度
    # K.transpose(-2, -1): 转置最后两个维度 -> [batch_size, num_heads, d_k, seq_len_k]
    matmul_qk = torch.matmul(Q, K.transpose(-2, -1))
    # 输出维度: [batch_size, num_heads, seq_len_q, seq_len_k]

    # 2. 缩放点积分数（控制方差）
    d_k = K.size(-1)	# 获取Key向量的维度d_k
    # 缩放因子: 1/sqrt(d_k)，防止梯度消失
    scaled_scores = matmul_qk / torch.sqrt( torch.tensor(d_k, dtype=torch.float32) )

    # # 3. 应用掩码（如需要）
    if mask is not None:
      # mask形状需广播至[batch_size, num_heads, seq_len_q, seq_len_k]
      # 将mask中0值位置替换为-1e9，Softmax后权重≈0
      scaled_scores = scaled_scores.masked_fill(mask==0, -1e9)

    # 4. 计算注意力权重
    # Softmax沿最后一个维度归一化 -> 概率分布[batch_size, num_heads, seq_len_q, seq_len_k]
    attn_weights = F.softmax(scaled_scores, dim=-1)

    # 5. 正则化注意力权重
    attn_weights = self.dropout(attn_weights)

    # 6. 加权聚合Value向量
    output = torch.matmul(attn_weights, V)

    return output, attn_weights

2.4 掩码

掩码有多种，在此以GPT等使用的因果掩码（Causal Mask / Future Masking）为例
因果掩码：当前时刻无法看到后续时刻的信息，即第t个token只能看到前面t-1个toekn的信息，防止未来信息泄露

2.4.1 掩码实现

实现方式：生成一个上三角矩阵，主对角线以上为-inf，其他位置为0

例如对于序列长度为3时：

  数值掩码：[[0, -inf, -inf],
           [0,   0, -inf],
           [0,   0,    0]]

  布尔掩码：[[ True, False, False],
           [ True,  True, False],
           [ True,  True,  True]]

代码实现：

size: 为seq_len
有两种实现方式：a.数值掩码， b.布尔掩码
- 数值掩码：主对角线以上位置为-inf值，可以直接加到注意力分数上，scaled_scores + causal_mask
- 布尔掩码：上三角为True，其余位置为False，使用masked_fill方法：scaled_scores = scaled_scores.masked_fill(mask==0, -1e9)

  # 数值掩码
  def create_causal_mask(seq_len):
      return torch.triu(torch.ones(seq_len, seq_len) * float('-inf'), diagonal=1)

  # 布尔掩码
  # 若取反，即对角线上方区域为True时，使用scaled_scores = scaled_scores.masked_fill(mask, -1e9)来覆盖注意力分数
  def create_causal_mask(seq_len):
      return torch.tril(torch.ones(seq_len, seq_len), diagonal=0).bool()

2.5 多头注意力机制

之前的实现中，num_heads为1，为单头注意力
对于多头注意力机制，是将嵌入维度d_model均分为h个头，每个head具有 d_model / num_heads的维度
num_heads 必须可以被 d_model 整除

2.5.1 为什么使用多头注意力

单头注意力容易过度关注自身位置，对角线权重过高，忽略其他关键位置
多头可以强迫模型从不同视角分析序列，降低局部偏好偏差
每个头可以捕获一种特征模式，拼接后得到更丰富的综合表示
整合多头信息，提升泛化能力
稀疏化注意力权重，每个词仅关注少数相关词（每个头仅需处理更小的子空间）

2.5.2 代码实现

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        # d_model：词向量嵌入的维度
        # num_heads：多头的数量
        super().__init__()
        assert d_model % num_heads == 0		# 必须可以整除

        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads	# 每个头的维度，均分

        # 输入序列经过线性变换转换为Q，K，V
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)

        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, q, k, v, need_mask=True):
        batch_size, seq_len, _ = q.shape
        
        # 1. 线性投影 + 多头分割
        # [batch_size, seq_len, d_model] --> [batch_size, seq_len, num_heads, head_dim] --> [batch_siz, num_heads, seq_len, head_dim]
        Q = self.W_q(q).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        K = self.W_k(k).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        V = self.W_v(v).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        
        # 2. 缩放点积注意力
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)
        
        # 3. 应用掩码
        if need_mask:
            mask = create_causal_mask(seq_len)
            scores = scores.masked_fill(mask==0, -1e9)
        
        # softmax 以及 计算最终结果
        attn_weights = F.softmax(scores, dim=-1)
        context = torch.matmul(attn_weights, V)
        
        # 4. 合并多头输出
        # [batch_size, num_heads, seq_len, head_dim] --> [batch_size, seq_len, d_model]
        context = context.transpose(1, 2).contiguous()  # 保持内存连续
        context = context.view(batch_size, -1, self.d_model)
        
        result = self.W_o(context)
        result = F.relu(result)
        
        return result

2.6 Layer Norm

import torch
import torch.nn as nn

class LayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-5, elementwise_affine=True):
        """
        	args:
        		normalized_shape: 需要被归一化的维度，在最后位置，int或者tuple
        		eps: float，数值稳定，防止除以零报错
        		elementwise_affine: bool，使用使用可学习的仿射变换
        """
        
        super().__init__()
        
        self.normalized_shape = normalized_shape
        self.eps = eps
        self.elementwise_affine = elementwise_affine
        
        if self.elementwise_affine:
            self.gamma = nn.Parameter( torch.ones(*self.normalized_shape) )
            self.beta  = nn.Parameter( torch.zeros(*self.normalized_shape) )
        else:
            self.register_parameter('gamma', None)
            self.register_parameter('beta', None)

    def forward(self, x):
        """
            args:
                x: input, 对于LLM来说，维度为[batch_size, seq_len, *normalized_shape]
        """

        # 计算需要归一化的维度：最后的len(self.normalized_shape)个维度需要归一化
        # 如：dims=(-3, -2, -1), dims=(-1)
        dims = tuple( range(-len(self.normalized_shape), 0) )

        # 计算均值于方差，使用有偏估计（分母为n），keepdim方便后面自动广播
        mean = x.mean(dim=dims, keepdim=True)
        var = x.var(dim=dims, unbiased=False, keepdim=True)		# unbiased=False 表示有偏估计

        # 标准化计算，方差 ---> 标准差
        std = torch.sqrt(var + self.eps)
        normalized = (x-mean) / std

        # 仿射变换
        if self.elementwise_affine:
            result = self.gamma * normalized + self.beta
        else:
            result = normalized

        return result

    def extra_repr(self):
        # 输出信息
        return f'normalized_shape: {self.normalized_shape}, eps: {self.eps}, elementwise_affine: {self.elementwise_affine}'

2.7 FFN前馈神经网络

Transformer的FNN是由两层全连接层实现的，使用了dropout机制防止过拟合，激活函数为Relu
$FFN(x)=Linear2(Dropout(ReLU/GELU(Linear1(x))))$
一般为先升维再降维，如512 --> 4*512 ---> 512

代码实现：

class FFN(nn.Module):
    def __init__(self, d_model, hidden_dim, dropout_rate=0.1):
        super().__init__()
        self.d_model = d_model
        self.hidden_dim = hidden_dim
        self.dropout_rate = dropout_rate
        
        self.layer_1 = nn.Linear(self.d_model, self.hidden_dim, bias=True)
        self.layer_2 = nn.Linear(self.hidden_dim, self.d_model, bias=True)
        self.dropout_layer = nn.Dropout(dropout_rate)
    
    def forward(self, x):
        result = self.layer_1(x)
        result = F.relu(result)
        result = self.dropout_layer(result)
        result = self.layer_2(result)
        
        return result

2.8 Encoder

Encoder 由多个 Encoder Layer 堆叠而成

2.8.1 Encoder Layer代码实现：

class EncoderLayer(nn.Module):
    def __init__(self, d_model, hidden_dim, seq_len, dropout_rate=0.1):
        super().__init__()
        
        # encoder 不需要因果掩码
        self.attention = MultiHeadAttention(d_model, num_heads, is_causal=False)
        self.FFN = FFN(d_model, hidden_dim, dropout_rate=0.1)
        
        # 每个子层使用独立的 layernorm 和 dropout
        self.norm_1 = LayerNorm(d_model)
        self.norm_2 = LayerNorm(d_model)
        self.dropout_1 = nn.Dropout(dropout_rate)
        self.dropout_2 = nn.Dropout(dropout_rate)
    
    def forward(self, x):
        # 使用Pre-LN：现代主流，归一化后残差
        norm_x = self.norm_1(x)					# 层归一化
        attn_output = self.attention(norm_x, norm_x, norm_x)	# [batch, seq_len, d_model]
        x = x + self.dropout_1(attn_output)		# 残差链接 + dropout
        
        norm_x = self.norm_2(x)
        ffn_output = self.FFN(norm_x)			# [batch, seq_len, d_model]
        x = x + self.dropout_2(ffn_output)		# 残差连接 + dropout					
        
        return x

2.8.2 Encoder代码实现：

class Encoder(nn.Module):
    def __init__(self, n_layer=6, d_model, dropout_rate=0.1):
        super().__inint__()
        
        self.layers = nn.MoudleList([
        	EncoderLayer(d_modle, dropout_rate) for _ in range(n_layer)    
        ])
        self.norm = LayerNorm(d_model)
    
    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return self.norm(x)

2.9 Decoder

Dncoder层有两个输入，分别是output的嵌入+位置编码、Encoder output
Dncoder具有两个注意力层，一层具有掩码，另一层没有掩码（接收Encoder output）

2.9.1 Decoder Layer

class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, hidden_dim, dropout_rate=0.1):
        super().__init__()
        
        self.attention_masked = MultiHeadAttention(d_model, num_heads, is_causal=True)
        self.attention_nomask = MultiHeadAttention(d_model, num_heads, is_causal=False)
        
        self.norm_1 = LayerNorm(d_model)
        self.norm_2 = LayerNorm(d_model)
        self.norm_3 = LayerNorm(d_model)
        
        self.dropout_1 = nn.Dropout(dropout_rate)
        self.dropout_2 = nn.Dropout(dropout_rate)
        self.dropout_3 = nn.Dropout(dropout_rate)
        
        self.ffn = FFN(d_model, hidden_dim, dropout_rate)
    
    def forward(self, output, enc_out):
        # --- 1. 掩码自注意力 ---
        norm_output = self.norm_1(output)
        attn_output = self.attention_masked(norm_output, norm_output, norm_output)
        output = output + self.dropout1(attn_output)  # 残差连接
        
        # --- 2. 交叉注意力 ---
        norm_output = self.norm_2(output)
        cross_attn = self.attention_nomask(norm_output, enc_out, enc_out)  # K/V=enc_out
        output = output + self.dropout_2(cross_attn)  # 残差连接
        
        # --- 3. 前馈网络 ---
        norm_output = self.norm_3(output)
        ffn_out = self.ffn(norm_output)
        output = output + self.dropout_3(ffn_out)  # 残差连接
        
        return output

2.9.2 Decoder代码实现

class Decoder(nn.Module):
    def __init__(self, dropout_rate=0.1, n_layers=6, d_model, num_heads, hidden_dim):
        super().__init__()
        self.layers = nn.ModuleList([
            DecoderLayer(d_model, num_heads, hidden_dim, dropout_rate) for _ in range(n_layers)
        ])
        self.norm = LayerNorm(d_model)
    
    def forward(self, output, enc_out):
        for layer in self.layers:
            # 使用output接收结果，避免每次循环使用原始output
            output = layer(output, enc_out)
        return self.norm(output)

2.10 位置编码

由于使用注意力机制时，没有考虑序列的前后位置关系，因此需要额外的位置信息

2.10.1 代码实现

class PositionalEncoding(nn.Module):
    def __init__(self, max_len, d_model, dropout_rate=0.1):
        super().__init__()
		self.dropout = nn.Dropout(dropout_rate)
        self.max_len = max_len
        self.d_model = d_model
        
        pe = torch.zeros(self.max_len, self.d_model)
        position = torch.arange(0, self.max_len).unsqueeze(1)
        
        div_term = torch.exp(
        	torch.arange(0, self.d_model, 2) * (-math.log(10000.0) /self.d_model)
        )
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
    def forward(self, input):
        seq_len = input.size(1)
        
        if seq_len > self.max_len:
            raise ValueError(f"序列长度{seq_len}超过预设最大值{self.max_len}")
        
        input = input + self.pe[:, :seq_len]
        result = self.dropout(input)
        
        return result

3. 利用组件实现一个Transformer

基于以及实现的组件，可以完整搭建一个Transformer模型

import torch
import torch.nn as nn
import torch.nn.functional as F

class Transformer(nn.Module):
  def __init__(self, args):
    super().__init__()
    self.args = args

    assert args.vocab_size is not None
    assert args.block_size is not None

    self.transformer = nn.ModuleDict(dict(
      wte = nn.Embedding(args.vocab_size, args.n_embd),
      wpe = PositionalEncoding(args),
      drop = nn.Dropout(args.dropout),
      encoder = Encoder(args),
      decoder = Decoder(args)
    ))

    # 线性层，输出词表大小
    self.lm_head = nn.Linear(args.n_embd, args.vocab_size, bias=True)

    self.apply(self._init_weights)

  def _init_weights(self, module):
    if isinstance(module, nn.Linear):
      torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
      if module.bias is not None:
        torch.nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
      torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
  
  def forward(self, idx, targets=None):
    device = idx.device
    b, t = idx.size()

    assert t <= self.args.block_size, f"Cannot forward sequence of length {t}, block size is only {self.args.block_size}"

    tok_emb = self.transformer.wte(idx)

    pos_emb = self.transformer.wpe(torch.arange(t, device=device))
    
    x = self.transformer.drop(pos_emb + tok_emb)

    enc_out = self.transformer.encoder(x)

    x = self.transformer.decoder(enc_out)

    if targets is not None:
      logits = self.lm_head(x)
      loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
    else:
      logits = self.lm_head(x[:, [-1], :])
      loss = None
    
    return logits, loss

posted @ 2025-06-16 20:56 AiHorizon 阅读(34) 评论(0) 收藏举报

刷新页面返回顶部

AiHorizon

LLM从零开始

LLM从零开始

1. LLM训练大致流程

2. 自注意力机制以及其他组件

2.1 详解自注意力机制

2.2 Q, K, V的理解：

2.3 注意力机制代码实现

2.4 掩码

2.4.1 掩码实现

2.5 多头注意力机制

2.5.1 为什么使用多头注意力

2.5.2 代码实现

2.6 Layer Norm

2.7 FFN前馈神经网络

2.8 Encoder

2.8.1 Encoder Layer代码实现：

2.8.2 Encoder代码实现：

2.9 Decoder

2.9.1 Decoder Layer

2.9.2 Decoder代码实现

2.10 位置编码

2.10.1 代码实现

3. 利用组件实现一个Transformer

公告