【自然语言处理】BERT模型 - 实践

自然语言处理（NLP）领域的发展历程始终围绕着如何让机器更精准地理解人类语言的上下文语义。在深度学习技术普及之前，基于规则或浅层统计的模型难以处理复杂的语境依赖问题；而随着神经网络的发展，预训练语言模型逐渐成为突破瓶颈的关键。2018 年之前，ELMo 模型通过双向 LSTM 提取上下文特征，但仍属于基于特征的静态表示方法；GPT 模型虽采用 Transformer 架构实现端到端微调，却受限于单向语言建模（仅能从左至右捕捉上下文），难以完整理解句子语义。直到 2018 年 Google 发布 BERT（Bidirectional Encoder Representations from Transformers），NLP 领域才真正迈入深度双向理解的新纪元。

BERT 的核心创新在于双向上下文建模：通过掩码语言模型（MLM）在预训练阶段随机遮盖部分 Token 并预测其原始值，使模型能同时关注左侧和右侧上下文，从而学习到深度双向语义表示。这一设计突破了传统模型的单向限制（如 GPT）或浅层拼接双向的局限（如 ELMo），配合“下一句预测”（NSP）任务，使模型在句子级和 Token 级任务上均实现性能跃升。作为基于 Transformer 编码器的预训练模型，BERT 通过“预训练+微调”范式，在问答、文本分类、情感分析、命名实体识别等十余项 NLP 任务中刷新基准成绩，成为现代 NLP 领域的里程碑。

自 2018 年发布以来，BERT 的影响力持续扩大：在 Hugging Face 平台上，其每月下载量超过 6800 万次，位列所有模型第二（仅次于 RoBERTa）。2024 年，Hugging Face、英伟达等机构联合推出升级版本 ModernBERT，通过 2 万亿 Token 的训练数据和优化架构，将上下文处理长度提升至 8192 Token，并提供 1.39 亿参数（基础版）与 3.95 亿参数（增强版）两种规格，在文本分类、向量检索等任务中进一步提升准确率，印证了 BERT 架构的持续生命力。

二、BERT模型核心算法原理

BERT模型的核心算法原理建立在Transformer架构与预训练-微调范式的基础上，通过双向上下文捕捉、并行化序列处理和创新的预训练任务设计，实现了对自然语言的深度理解。其核心技术模块包括Transformer编码器结构、自注意力机制、输入表示方法及预训练任务设计，各模块协同作用形成了BERT的独特优势。

（一）Transformer编码器结构

BERT的核心架构基于Transformer的编码器部分，通过堆叠多层Transformer模块实现对双向上下文信息的捕捉。每一层编码器由多头自注意力机制和前馈神经网络（FFN） 两大核心组件构成，并通过残差连接和层归一化保障深层网络的训练稳定性。

多头自注意力机制通过并行计算多个自注意力头，将输入序列映射到不同的表示子空间，从而捕捉多维度的上下文关联。前馈神经网络则对每个位置的注意力输出进行非线性变换，进一步提炼特征。残差连接通过将输入直接添加到子层输出中，缓解了深层网络的梯度消失问题；层归一化则通过标准化各层输入分布，加速模型收敛。这种结构设计使得BERT能够替代传统循环神经网络（RNN）或长短期记忆网络（LSTM）的顺序处理模式，实现输入序列的并行化处理，显著提升了长距离依赖关系的捕捉效率。

（二）自注意力机制

自注意力机制是BERT实现上下文信息融合的核心技术，其通过计算序列中每个词与其他所有词的关联权重，实现全局上下文的并行建模。与RNN/LSTM的顺序处理不同，自注意力机制无需按时间步迭代，可直接通过矩阵运算一次性计算所有位置的依赖关系，大幅提升了计算效率。

自注意力机制的核心在于查询（Q）、键（K）、值（V） 的交互计算：通过将输入向量线性变换为Q、K、V矩阵，利用缩放点积注意力公式计算每个位置对其他位置的注意力权重，再通过softmax归一化得到权重分布，最终加权求和V矩阵得到注意力输出。多头注意力通过将Q、K、V分割为多个子空间并行计算，再拼接各头输出，增强了模型对不同上下文模式的捕捉能力。这种设计使得BERT能够同时从左到右和右到左方向理解文本，突破了传统单向语言模型的局限。

自注意力机制优势：相较于CNN通过卷积核局部感受野捕捉局部依赖，自注意力机制可直接建模序列中任意两个位置的关联，且计算复杂度与序列长度呈二次关系，长文本处理效率显著优于RNN/LSTM的线性复杂度。

（三）输入表示方法

BERT的输入层通过融合三种嵌入表示（Token Embedding、Position Embedding、Segment Embedding）将原始文本转换为模型可处理的向量形式，同时引入特殊标记实现对任务需求的适配。

• Token Embedding：将输入文本中的每个词或子词映射为高维向量，捕捉词汇的语义信息，其通过预训练的词表实现，支持动态更新以适应上下文变化。

• Position Embedding：由于Transformer架构本身不具备位置感知能力，BERT通过添加位置嵌入向量编码词语在序列中的位置信息，使模型能够区分不同位置的相同词汇。

• Segment Embedding：用于区分输入序列中的句子对（如问答任务中的问题与答案），通过为不同句子分配不同的嵌入向量，帮助模型理解句子边界和逻辑关系。

三种嵌入向量通过逐元素相加的方式融合，形成最终的输入表示。此外，BERT定义了三类特殊标记：[CLS] 位于序列起始位置，其对应的编码器输出用于下游分类任务；[SEP] 用于分隔句子对或标记单句结束；[MASK] 用于掩码语言模型（MLM）任务，标记被掩盖的词汇以实现双向语境学习。

（四）预训练任务设计

BERT的预训练阶段通过两个核心任务实现对双向语言表示的学习：掩码语言模型（MLM）和下一句预测（NSP），二者共同优化模型参数以捕捉词汇语义和句子关系。

1.掩码语言模型（MLM）

MLM任务通过随机掩码输入序列中15%的Token并预测其原始值，强制模型从双向上下文推断被掩盖词汇，从而学习深层语义关联。其掩码策略具体为：80%的掩码位置替换为[MASK] 标记，10%替换为随机词汇，10%保留原始词汇。这种设计既避免了预训练与微调阶段的数据分布差异（因微调时无[MASK]标记），又通过随机替换引入噪声增强模型的鲁棒性。

MLM任务的目标是最小化预测损失，即通过编码器输出的MASK]位置向量与词表计算交叉熵损失，使模型能够基于上下文准确恢复被掩盖词汇。

2.下一句预测（NSP）

NSP任务旨在增强模型对句子间逻辑关系的理解，通过输入一对句子（A和B），让模型判断B是否为A的后续句子。训练数据中50%为连续句子对（正例），50%为随机拼接的非连续句子对（反例）。模型通过CLS]标记的编码器输出进行二分类（是/否），目标函数为交叉熵损失，优化模型对句子连贯性和语义关联性的捕捉能力。

MLM与NSP的协同作用：MLM专注于词汇级语义建模，NSP则聚焦句子级关系学习，二者通过多任务学习框架联合优化，使BERT的预训练表示同时具备微观词汇理解和宏观篇章结构感知能力。

通过上述四大模块的协同设计，BERT实现了对双向上下文信息的深度挖掘，其预训练的通用语言表示可通过微调快速适配各类下游任务，成为自然语言处理领域的基础性模型架构。

三、BERT模型的完整实现

（一）导入必要模块

import torch
import torch.nn as nn
import torch.utils.data as data
from torch.optim import AdamW
from torch.optim.lr_scheduler import LinearLR
import random
import os
import re
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import matplotlib.font_manager as fm  # 字体管理工具

（二）文本数据预处理

第1步：`create_mlm_mask函数实现`

create_mlm_mask 是 BERT 预训练中掩码语言模型（MLM）任务的核心数据处理函数，用于对输入的 token 序列进行随机掩码处理，生成符合 MLM 任务要求的输入序列（input_ids）和对应的标签（labels）。其设计严格遵循 BERT 的 MLM 策略，通过随机替换部分 token，迫使模型学习上下文语义以预测被掩码的原始 token，从而提升模型的语义理解能力。

在 BERT 预训练的 MLM 任务中，模型需要根据上下文预测被 “掩码” 的 token。该函数的核心作用是：

按照特定概率（15%）随机选择序列中的 token 进行处理；
对选中的 token 进行三种操作（80% 替换为[MASK]、10% 替换为随机 token、10% 保持原 token）；
生成对应的标签（仅保留被处理 token 的原始值，其余位置设为 - 100 以忽略损失）。

这一过程是 MLM 任务的前置数据处理，直接影响模型能否有效学习上下文依赖关系。

def create_mlm_mask(input_ids, vocab_size, mask_token_id=103, pad_token_id=0):
    labels = input_ids.clone()
    probability_matrix = torch.full(labels.shape, 0.15)
    special_tokens_mask = (input_ids == pad_token_id)
    probability_matrix.masked_fill_(special_tokens_mask, value=0.0)
    masked_indices = torch.bernoulli(probability_matrix).bool()
    labels[~masked_indices] = -100
    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
    input_ids[indices_replaced] = mask_token_id
    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
    random_words = torch.randint(vocab_size, labels.shape, dtype=torch.long)
    input_ids[indices_random] = random_words[indices_random]
    return input_ids, labels

参数名	类型	默认值	作用说明
`input_ids`	`torch.Tensor`	-	输入的 token ID 序列，形状为`(batch_size, seq_len)`（如`[101, 2182, 102, ...]`，含`[CLS]`、`[SEP]`等）。
`vocab_size`	`int`	-	词汇表大小，用于生成随机 token ID。
`mask_token_id`	`int`	103	`[MASK]`特殊 token 的 ID（BERT 默认 103），用于替换选中的 token。
`pad_token_id`	`int`	0	`[PAD]`填充 token 的 ID（BERT 默认 0），这些位置不会被掩码（避免无意义学习）。

步骤 1：初始化标签（保留原始 token ID）

labels = input_ids.clone()  # 复制原始input_ids作为标签基础

目的：MLM 的目标是预测被掩码 token 的原始值，因此标签初始化为原始input_ids的副本，后续仅修改非掩码位置的标签。

步骤 2：确定 token 被选中的概率（15% 基础概率）

probability_matrix = torch.full(labels.shape, 0.15)  # 所有位置初始概率为15%

目的：BERT 的 MLM 策略中，每个 token 有 15% 的概率被选中进行掩码处理（非绝对概率，通过后续随机采样实现）。

步骤 3：排除特殊 token（如 [PAD]）

special_tokens_mask = (input_ids == pad_token_id)  # 标记[PAD]位置（True表示是PAD）
probability_matrix.masked_fill_(special_tokens_mask, value=0.0)  # PAD位置的选中概率设为0

目的：[PAD]是填充用的无效 token，无需对其进行掩码（避免模型学习无意义的填充内容），因此将其选中概率强制设为 0。

步骤 4：随机选择被掩码的 token（15% 概率采样）

masked_indices = torch.bernoulli(probability_matrix).bool()  # 伯努利采样，True表示被选中

原理：torch.bernoulli根据probability_matrix中的概率（15%）随机生成布尔矩阵，True表示该 token 被选中进行掩码处理（约 15% 的 token 会被选中）。

步骤 5：处理标签（仅保留被掩码位置的原始值）

labels[~masked_indices] = -100  # 未被掩码的位置标签设为-100（计算损失时会被忽略）

目的：MLM 损失仅关注被掩码的 token（需要预测的位置），非掩码位置的标签设为 - 100（PyTorch 的CrossEntropyLoss会自动忽略标签为 - 100 的样本）。

步骤 6：80% 被选中 token 替换为 [MASK]

# 生成80%概率的掩码，与被选中位置取交集（得到需替换为[MASK]的位置）
indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
input_ids[indices_replaced] = mask_token_id  # 替换为[MASK]的ID

逻辑：在被选中的masked_indices中，80% 的位置会被明确替换为[MASK] token，这是 MLM 的主要掩码方式，迫使模型根据上下文预测原始 token。

步骤 7：10% 被选中 token 替换为随机 token

# 剩余20%被选中位置中，50%（即总选中数的10%）替换为随机token
indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
random_words = torch.randint(vocab_size, labels.shape, dtype=torch.long)  # 生成随机token ID
input_ids[indices_random] = random_words[indices_random]  # 替换为随机token

逻辑：在未被替换为[MASK]的被选中位置（占总选中数的 20%）中，再随机选择 50%（即总 token 数的 10%），替换为词汇表中随机的 token。这一设计防止模型 “偷懒”（仅记住[MASK]的位置而不学习真实语义），迫使模型真正理解上下文。

步骤 8：剩余 10% 被选中 token 保持不变

逻辑：经过步骤 6 和 7 后，剩余的被选中 token（占总选中数的 10%）保持原始值不变。这一设计确保模型不能仅通过 “位置是否被修改” 来猜测答案，必须依赖上下文判断每个 token 是否为原始值。

数据	形状	作用说明
输入`input_ids`	`(batch_size, seq_len)`	原始 token ID 序列（含`[CLS]`、`[SEP]`、`[PAD]`等）。
输出`input_ids`	`(batch_size, seq_len)`	经过掩码处理后的 token ID 序列（含`[MASK]`、随机 token 和部分原始 token）。
输出`labels`	`(batch_size, seq_len)`	MLM 任务的标签：被掩码位置为原始 token ID，其余位置为 - 100（忽略损失）。

create_mlm_mask 函数实现了 BERT 预训练中 MLM 任务的核心掩码逻辑，通过 15% 的选中概率、80%→[MASK]、10%→随机 token、10%→保持不变的分层处理策略，生成符合任务要求的输入和标签。这一过程是 MLM 任务的基础，直接影响模型能否学习到有效的上下文语义表示，是 BERT 预训练流程中不可或缺的数据预处理环节。

第2步：`TextProcessor` 类实现

TextProcessor 类是 BERT 预训练数据处理的核心组件，专门用于将原始文本转换为符合 BERT 模型输入格式的结构化数据，重点服务于 BERT 的下一句预测（NSP）任务和序列长度适配。其核心目标是：解决长文本截断、生成 NSP 所需的句子对、输出模型可直接使用的input_ids、attention_mask、token_type_ids三大核心输入。

BERT 模型的输入有严格格式要求（需包含特殊符号、固定长度、区分句子边界等），而原始文本通常是无结构的长文本，TextProcessor 承担了 “原始文本 → 模型输入” 的中间转换角色，是连接数据与模型的关键桥梁，直接影响 BERT 预训练的效果（尤其是 NSP 任务的样本质量）。

class TextProcessor:
    def __init__(self, tokenizer, max_seq_len=512):
        self.tokenizer = tokenizer
        self.max_seq_len = max_seq_len
        self.cls_token = tokenizer.cls_token
        self.sep_token = tokenizer.sep_token
        self.pad_token = tokenizer.pad_token
    def split_long_text(self, text, chunk_size=256, overlap=50):
        tokens = self.tokenizer.tokenize(text)
        chunks = []
        start = 0
        while start < len(tokens):
            end = start + chunk_size
            chunk_tokens = tokens[start:end]
            chunks.append(chunk_tokens)
            start = end - overlap
        return chunks
    def create_sentence_pairs(self, chunks, prob_next=0.5):
        pairs = []
        for i in range(len(chunks) - 1):
            if random.random() < prob_next:
                sentence1 = chunks[i]
                sentence2 = chunks[i + 1]
                label = 0
            else:
                sentence1 = chunks[i]
                rand_idx = random.randint(0, len(chunks) - 1)
                while rand_idx == i or rand_idx == i + 1:
                    rand_idx = random.randint(0, len(chunks) - 1)
                sentence2 = chunks[rand_idx]
                label = 1
            pairs.append((sentence1, sentence2, label))
        return pairs
    def tokenize_pair(self, sentence1, sentence2):
        tokens = [self.cls_token] + sentence1 + [self.sep_token] + sentence2 + [self.sep_token]
        token_type_ids = [0] * (len(sentence1) + 2) + [1] * (len(sentence2) + 1)
        if len(tokens) > self.max_seq_len:
            tokens = tokens[:self.max_seq_len]
            token_type_ids = token_type_ids[:self.max_seq_len]
        input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
        attention_mask = [1] * len(input_ids)
        padding_length = self.max_seq_len - len(input_ids)
        input_ids += [self.tokenizer.pad_token_id] * padding_length
        attention_mask += [0] * padding_length
        token_type_ids += [0] * padding_length
        return {
            "input_ids": torch.tensor(input_ids, dtype=torch.long),
            "attention_mask": torch.tensor(attention_mask, dtype=torch.long),
            "token_type_ids": torch.tensor(token_type_ids, dtype=torch.long)
        }

1. 初始化方法 `init`：定义基础配置与特殊符号

功能详解：

接收外部传入的tokenizer（负责文本→token 的转换），确保分词逻辑与模型 vocab 一致；
max_seq_len 限制输入序列的最大长度（如代码中后续配置为 64/128），避免超出 BERT 预定义的max_position_embeddings（否则会触发 position embedding 索引错误）；
提前获取 BERT 必需的 3 个特殊符号，避免后续重复调用tokenizer，提升效率。

2. 长文本拆分 `split_long_text`：解决 BERT 序列长度限制

功能详解：

核心问题解决：BERT 对输入序列长度有严格限制（如默认 512），若原始文本过长（如段落、文档），直接分词会超出限制，因此需要拆分为多个短 token 块；
关键参数作用：
- chunk_size：每个 token 块的最大长度（通常小于max_seq_len，预留特殊符号位置）；
- overlap：相邻 token 块的重叠 token 数量（如 50），避免拆分时切断完整语义（例如一句话被拆成两段，重叠部分可保留上下文关联性）；
示例：若文本分词后有 1000 个 token，chunk_size=256、overlap=50，则拆分后为：[0-255, 206-461, 412-667, 618-873, 824-999]，确保无语义断裂。

3. 句子对生成 `create_sentence_pairs`：为 NSP 任务构建样本

功能详解：

服务于 BERT 的 NSP 任务：NSP（Next Sentence Prediction）是 BERT 预训练的两大任务之一，目标是让模型判断 “句子 2 是否是句子 1 的下一句”，该方法直接生成 NSP 所需的正负样本；
样本平衡设计：
- prob_next=0.5：控制正负例比例为 1:1，符合 BERT 预训练的样本分布要求，避免模型偏向某一类；
- 负例排除rand_idx == i和rand_idx == i+1：确保负例的两个句子确实无连续语义，避免模型学习到 “近邻块即正例” 的错误规律；
输出格式：每个元素是(sentence1_tokens, sentence2_tokens, nsp_label)，为后续转换为模型输入奠定基础。

4. 句子对编码 `tokenize_pair`：生成 BERT 标准输入格式

功能详解：这是类的核心方法，最终输出 BERT 模型必需的三大输入，每个输入的作用与格式严格遵循 BERT 规范：

输出字段	作用	格式要求
`input_ids`	将 token 映射为词汇表中的索引，是模型的核心输入	长度 = max_seq_len，元素为 int（词汇表索引），PAD 用`pad_token_id`（如 0）
`attention_mask`	告诉模型哪些 token 是有效输入（1）、哪些是 PAD（0），避免 PAD 参与计算	长度 = max_seq_len，元素为 0/1，与 input_ids 一一对应
`token_type_ids`	区分句子对中的两个句子（0 = 句 1，1 = 句 2），服务于 NSP 任务	长度 = max_seq_len，元素为 0/1，PAD 部分填 0（不影响模型判断）

关键细节：
- token 序列格式严格为[CLS]句1[SEP]句2[SEP]：[CLS]是模型的分类向量（用于 NSP 任务预测），[SEP]用于明确句子边界；
- 截断与填充的同步性：截断时需同时处理tokens和token_type_ids，填充时需确保三个字段长度均为max_seq_len，避免批量训练时维度不匹配；
- 数据类型为torch.long：PyTorch 中 embedding 层和 Transformer 层要求输入为长整型（long），不能用默认的 int。

假设输入原始文本为一段关于 “经济全球化” 的长文，TextProcessor 的处理流程如下：

长文本拆分：split_long_text 将长文分词后拆分为多个 token 块（如chunk1, chunk2, chunk3, chunk4）；
生成句子对：create_sentence_pairs 生成 NSP 样本（如(chunk1, chunk2, 0)、(chunk1, chunk3, 1)等）；
编码为模型输入：tokenize_pair 将每个句子对转换为input_ids、attention_mask、token_type_ids的张量格式；
输出给数据集：最终结果被TextDataset类收集，组成训练数据集，供 BERT 模型训练使用。

第3步：TextDataset类实现

TextDataset 是基于 PyTorch data.Dataset 抽象类的自定义数据集类，专门为 BERT 预训练任务（同时支持掩码语言模型 MLM和下一句预测 NSP）提供结构化样本。其核心功能是：将原始文本文件（或兜底的示例文本）通过加载、预处理、任务标签生成，最终输出 BERT 模型可直接训练的样本格式，是连接原始数据与训练流程的关键组件。

BERT 预训练需要两类核心监督信号（MLM 的掩码恢复标签、NSP 的句子关系标签），且输入格式需严格匹配模型要求（input_ids、attention_mask等）。TextDataset 封装了从 “文本文件” 到 “带双任务标签的模型输入” 的全流程处理，解决了数据加载容错、样本质量控制、任务标签生成三大核心问题，确保训练数据的可用性和一致性。

class TextDataset(data.Dataset):
    def __init__(self, txt_file_path, tokenizer, max_seq_len=512, min_chunk_len=50):
        self.txt_file_path = txt_file_path
        self.tokenizer = tokenizer
        self.processor = TextProcessor(tokenizer, max_seq_len)
        self.min_chunk_len = min_chunk_len
        self.data = self._load_and_preprocess()
    def _load_and_preprocess(self):
        try:
            with open(self.txt_file_path, 'r', encoding='utf-8') as f:
                text = f.read().replace('\n', ' ').strip()
        except FileNotFoundError:
            print(f"未找到 {self.txt_file_path}，使用内置示例文本进行训练")
            text = """
            Economic globalization refers to the increasing interdependence of world economies
            as a result of the growing scale of cross-border trade of commodities and services,
            flow of international capital and wide and rapid spread of technologies.
            It reflects the continuing expansion and mutual integration of market frontiers,
            and is an irreversible trend for the economic development in the whole world at the turn of the millennium.
            The rapid growing significance of information in all types of productive activities
            and marketization are the two major driving forces for economic globalization.
            In other words, the fast globalization of the world’s economies in recent years
            is largely based on the rapid development of science and technologies,
            has resulted from the environment in which market economic system has been fast spreading throughout the world.
            """
        chunks = self.processor.split_long_text(text)
        chunks = [c for c in chunks if len(c) >= self.min_chunk_len]
        if len(chunks) < 2:
            raise ValueError("文本过短，无法生成足够的句子对")
        pairs = self.processor.create_sentence_pairs(chunks)
        samples = []
        for s1, s2, nsp_label in pairs:
            features = self.processor.tokenize_pair(s1, s2)
            features["nsp_label"] = torch.tensor(nsp_label, dtype=torch.long)
            samples.append(features)
        return samples
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        sample = self.data[idx]
        input_ids = sample["input_ids"].clone()
        input_ids, mlm_labels = create_mlm_mask(
            input_ids,
            vocab_size=self.tokenizer.vocab_size,
            mask_token_id=self.tokenizer.mask_token_id,
            pad_token_id=self.tokenizer.pad_token_id
        )
        return {
            "input_ids": input_ids,
            "attention_mask": sample["attention_mask"],
            "token_type_ids": sample["token_type_ids"],
            "mlm_labels": mlm_labels,
            "nsp_labels": sample["nsp_label"]
        }

1. 初始化方法 `init`：定义数据集基础配置

关键作用：

接收外部参数，明确数据来源（txt_file_path）、分词逻辑（tokenizer）、序列长度限制（max_seq_len）和样本质量阈值（min_chunk_len）；
直接调用 _load_and_preprocess 完成数据预处理，将结果存入 self.data，后续通过 __getitem__ 索引访问，避免重复处理。

2. 核心预处理方法 `_load_and_preprocess`：文本→初始样本（含 NSP 标签）

该方法是数据集的 “数据生产线”，分 6 步将原始文本转换为带 NSP 标签的初始样本，具体流程如下：

步骤 1：加载文本（含容错处理）

细节说明：

replace('\n', ' ')：将文本中的换行符替换为空格，避免换行符被分词器误判为特殊符号，保证文本语义连贯；
容错机制：即使用户未提供文本文件，也能通过内置示例文本继续训练，降低使用门槛。

步骤 2：拆分长文本为 token 块（复用 TextProcessor）

作用：

直接复用 TextProcessor 的 split_long_text 逻辑，将原始文本分词后拆分为多个重叠的 token 块（如chunk_size=256、overlap=50），解决 BERT 序列长度限制问题。

步骤 3：过滤短 token 块（保证样本质量）

核心目的：

过滤过短的 token 块（如长度 <50）：这类块语义不完整，生成的句子对（如 “仅 3 个 token 的句 1+5 个 token 的句 2”）会导致 NSP 任务样本质量低，影响模型对 “句子关系” 的学习。

步骤 4：检查块数量（确保 NSP 任务可生成）

必要性：

NSP 任务需要至少 2 个 token 块才能生成句子对（如chunk1&chunk2），若块数量 < 2，无法构建 NSP 样本，直接抛出错误提示用户补充文本。

步骤 5：生成 NSP 句子对（复用 TextProcessor）

作用：

复用 TextProcessor 的 create_sentence_pairs 逻辑，生成 NSP 任务所需的正负样本（正例：连续块，标签 0；负例：随机块，标签 1），保证正负例比例约 1:1。

步骤 6：编码句子对 + 添加 NSP 标签（生成初始样本）

输出格式说明：

此时的 samples 已具备 NSP 任务所需的全部输入，但缺少 MLM 任务的掩码和标签，后续在 __getitem__ 中补充。

3. 长度方法 `len`：Dataset 抽象类必需实现

作用：

告知 PyTorch 数据集的总样本数，是 DataLoader 批量加载数据（如计算 batch 数量、是否打乱）的基础。

4. 索引方法 `getitem`：生成最终训练样本（含 MLM 标签）

核心逻辑与细节：这是为 MLM 任务补充标签的关键步骤，需重点理解 create_mlm_mask 的作用（已在之前代码中定义，此处复用）：

MLM 掩码规则（遵循 BERT 原始论文）：
1. 随机选择 15% 的 token 进行处理；
2. 其中 80% 的 token 替换为 [MASK]（mask_token_id）；
3. 其中 10% 的 token 替换为随机 token（从 vocab 中随机选）；
4. 其中 10% 的 token 保持不变（用于稳定训练）；
5. PAD 位置不参与掩码（pad_token_id 对应的位置跳过）。
mlm_labels 的特殊性：
- 仅被掩码 / 替换的位置保留原始 token 的 id（作为监督信号）；
- 未被选中的位置设为 -100（PyTorch 的CrossEntropyLoss会自动忽略 -100 标签，避免无意义的监督）。
克隆 input_ids 的原因：
- 若直接修改 sample["input_ids"]，会导致原始样本被污染（后续其他索引访问时会拿到掩码后的结果），clone() 确保每个样本的原始数据独立。

__getitem__ 返回的每个样本是一个字典，包含 5 个关键字段，完全对应 BertForPretraining 模型的输入参数，无需额外转换即可直接传入模型训练：

字段名	数据类型	作用说明
`input_ids`	torch.LongTensor	掩码后的 token 索引序列，模型的核心输入（长度 = max_seq_len）
`attention_mask`	torch.LongTensor	注意力掩码，值为 1 表示有效 token（非 PAD），0 表示 PAD（避免模型关注无效数据）
`token_type_ids`	torch.LongTensor	句子区分标识，0 表示句 1 及相关特殊符号，1 表示句 2 及相关特殊符号（NSP 任务用）
`mlm_labels`	torch.LongTensor	MLM 任务监督标签，仅 15% 被处理的位置为有效 id，其他为 - 100（忽略计算）
`nsp_labels`	torch.LongTensor	NSP 任务监督标签，0 表示 “句 2 是句 1 的下一句”，1 表示 “不是”

以 “Economic Globalization.txt文本文件” 为例，TextDataset 的完整处理流程如下：

加载文本：读取文件→消除换行符→得到连续长文本；
拆分块：长文本分词→拆分为重叠 token 块→过滤短块（如保留长度≥50 的块）；
生成 NSP 样本：token 块→生成（句 1, 句 2, 标签）对→编码为初始样本（含 NSP 标签）；
生成 MLM 样本：索引访问时克隆 input_ids→掩码处理→补充 mlm_labels；
输出模型输入：返回含双任务标签的完整样本，供 DataLoader 批量加载。

（三）BERT模型完整实现

第1步：BertManualTokenizer类实现

BertManualTokenizer 是一个自定义的 BERT 风格分词器，实现了与原生 BERT 分词器相似的核心功能：将原始文本转换为模型可识别的token序列，并支持token与ID的双向映射。它专为适配代码中的 BERT 预训练任务设计，通过WordPiece 分词算法处理未登录词（OOV），并包含 BERT 必需的特殊符号，确保文本输入符合模型的格式要求。

在 BERT 模型中，分词器是连接原始文本与模型输入的 “翻译官”：它将人类可理解的文本转换为模型可处理的token序列，再映射为词汇表中的整数 ID。BertManualTokenizer 手动实现了这一过程，避免依赖第三方库（如 Hugging Face 的tokenizers），同时针对示例文本（经济全球化相关内容）优化了词汇表，确保分词效果适配训练数据。

class BertManualTokenizer:
    def __init__(self):
        self.vocab = self.create_basic_vocab()
        self.ids_to_tokens = {v: k for k, v in self.vocab.items()}
        # 特殊符号定义
        self.unk_token = "[UNK]"
        self.cls_token = "[CLS]"
        self.sep_token = "[SEP]"
        self.mask_token = "[MASK]"
        self.pad_token = "[PAD]"
        # 特殊符号ID
        self.unk_token_id = self.vocab[self.unk_token]
        self.cls_token_id = self.vocab[self.cls_token]
        self.sep_token_id = self.vocab[self.sep_token]
        self.mask_token_id = self.vocab[self.mask_token]
        self.pad_token_id = self.vocab[self.pad_token]
        self.basic_tokenizer = self._create_basic_tokenizer()
    def create_basic_vocab(self):
        """扩展词汇表，确保测试单词和子词覆盖"""
        vocab = {}
        # 1. 特殊符号
        special_tokens = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
        for token in special_tokens:
            vocab[token] = len(vocab)
        # 2. 单个字母（小写，确保子词拆分兜底）
        for c in "abcdefghijklmnopqrstuvwxyz":
            vocab[c] = len(vocab)
        # 3. 常用单词（覆盖测试文本和示例文本）
        common_words = [
            "hello", "world", "this", "is", "a", "test", "of", "the", "bert", "tokenizer",
            "economic", "globalization", "refers", "to", "increasing", "interdependence",
            "world", "economies", "result", "growing", "scale", "cross-border", "trade",
            "commodities", "services", "flow", "international", "capital", "wide", "rapid",
            "spread", "technologies", "reflects", "continuing", "expansion", "mutual",
            "integration", "market", "frontiers", "irreversible", "trend", "development",
            "millennium", "rapid", "growing", "significance", "information", "productive",
            "activities", "marketization", "major", "driving", "forces", "other", "words",
            "fast", "largely", "based", "science", "technologies", "resulted", "environment",
            "system", "spreading", "throughout"
        ]
        for word in list(set(common_words)):
            if word not in vocab:
                vocab[word] = len(vocab)
        # 4. 常用子词（带##前缀，覆盖单词后缀）
        subwords = ["##s", "##ing", "##ed", "##ly", "##er", "##est", "##o", "##r", "##ld",
                    "##tion", "##ment", "##ity", "##ive", "##ize", "##al", "##able", "##ible"]
        for subword in subwords:
            if subword not in vocab:
                vocab[subword] = len(vocab)
        return vocab
    def _create_basic_tokenizer(self):
        """简化正则，拆分单词、标点"""
        pattern = r"(\w+|[^\w\s])"
        return re.compile(pattern)
    def basic_tokenize(self, text):
        """基础分词：小写+拆分"""
        text = text.lower().strip()
        return self.basic_tokenizer.findall(text)
    def wordpiece_tokenize(self, token):
        """WordPiece分词+兜底（拆分为单个字母）"""
        sub_tokens = []
        start = 0
        n = len(token)
        while start < n:
            end = n
            found = False
            while end > start:
                substr = token[start:end]
                if start > 0:
                    substr = f"##{substr}"
                if substr in self.vocab:
                    sub_tokens.append(substr)
                    start = end
                    found = True
                    break
                end -= 1
            if not found:
                sub_tokens.append(token[start])  # 兜底：单个字母（已在词汇表）
                start += 1
        return sub_tokens
    def tokenize(self, text):
        """完整分词流程"""
        tokens = []
        for basic_token in self.basic_tokenize(text):
            tokens.extend(self.wordpiece_tokenize(basic_token))
        return tokens
    def convert_tokens_to_ids(self, tokens):
        return [self.vocab.get(token, self.unk_token_id) for token in tokens]
    def convert_ids_to_tokens(self, ids):
        return [self.ids_to_tokens.get(id, self.unk_token) for id in ids]
    @property
    def vocab_size(self):
        return len(self.vocab)

1. 初始化方法 `init`：构建分词器基础配置

核心作用：

构建词汇表及其反向映射，为token与ID的转换提供基础；
定义 BERT 预训练必需的 5 个特殊符号（功能见下表），并绑定其 ID；
初始化基础分词器，用于文本的初步拆分（单词、标点分离）。

特殊符号	作用说明
`[PAD]`	填充短序列至固定长度，确保批量输入维度一致
`[UNK]`	表示未在词汇表中出现的 token
`[CLS]`	序列起始标记，其对应的输出向量用于 NSP 任务预测
`[SEP]`	分隔句子对（如 “句 1 [SEP] 句 2 [SEP]”），明确句子边界
`[MASK]`	MLM 任务中用于替换被选中的 token，让模型预测原始 token

2. 词汇表构建 `create_basic_vocab`：覆盖核心 token

该方法手动构建词汇表，确保对训练文本（经济全球化相关内容）的高覆盖率，减少[UNK]的出现。词汇表包含 4 类核心 token：

设计逻辑：

特殊符号优先：确保[PAD]等核心符号的 ID 固定，避免模型输入混乱；
单个字母兜底：当子词拆分失败时，可拆分为单个字母（已在词汇表中），避免出现[UNK]；
针对性覆盖：常用单词和子词均来自示例文本（经济全球化相关），最大化分词准确率。

3. 基础分词器 `_create_basic_tokenizer` 与 `basic_tokenize`：文本初步拆分

功能详解：

基础分词流程：先将文本标准化（小写、去空格），再通过正则拆分出 “单词” 和 “标点符号”；
示例：输入 "Hello, world!" → 处理后为 ["hello", ",", "world", "!"]，为后续子词拆分提供基础单元。

4. WordPiece 分词 `wordpiece_tokenize`：处理未登录词的核心算法

BERT 的核心分词逻辑，将基础 token 拆分为词汇表中存在的子词（带##前缀表示非起始子词），解决未登录词问题：

算法逻辑（以"globalization"为例）：

尝试最长子串"globalization"→若不在词汇表，缩短为"globalizatio"→...→直到"global"（假设在词汇表）；
剩余部分"ization"从 start=6 开始，加##前缀→尝试"##ization"→若不在词汇表，缩短至"##ization"→...→直到"##tion"（假设在词汇表）；
最终拆分结果：["global", "##ization"]（均在词汇表中）。

优势：通过子词组合，用有限的词汇表覆盖大量未登录词，平衡词汇表大小和分词准确率。

5. 完整分词流程 `tokenize`：串联基础分词与 WordPiece 拆分

示例：输入文本 "Economic globalization!" → 完整分词流程：

basic_tokenize → ["economic", "globalization", "!"]；
wordpiece_tokenize("economic") → ["economic"]（假设在词汇表）；
wordpiece_tokenize("globalization") → ["global", "##ization"]；
wordpiece_tokenize("!") → ["!"]（标点直接保留）；
最终结果：["economic", "global", "##ization", "!"]。

6. token 与 ID 的双向转换：`convert_tokens_to_ids` 与 `convert_ids_to_tokens`

作用：

是连接 “人类可读 token” 与 “模型可读整数 ID” 的桥梁；
示例：["[CLS]", "economic", "##ization"] → 转换为 [2, 35, 82]（假设对应 ID）。

7. 词汇表大小 `vocab_size`：模型配置的必要参数

作用：BERT 的词嵌入层（word_embeddings）需要根据词汇表大小初始化（nn.Embedding(vocab_size, hidden_size)），该属性为模型提供必要的配置信息。

以示例文本 "Hello, world! This is a test." 为例，完整流程如下：

文本标准化：basic_tokenize → 转为小写并拆分 → ["hello", ",", "world", "!", "this", "is", "a", "test", "."]；
WordPiece 拆分：wordpiece_tokenize 逐个处理 → 假设所有基础 token 均在词汇表中，结果不变；
添加特殊符号：后续在TextProcessor中补充[CLS]和[SEP] → ["[CLS]", "hello", ",", "world", "!", "this", "is", "a", "test", ".", "[SEP]"]；
转换为 ID：convert_tokens_to_ids → 得到整数 ID 序列（如[2, 31, 95, 32, 96, 33, 34, 35, 36, 97, 3]），作为模型的输入。

第2步：BertConfig类实现

BertConfig 是 BERT 模型的配置类，用于集中管理 BERT 模型的核心超参数和结构参数。它定义了模型各组件（如嵌入层、注意力层、前馈网络等）的关键属性，是构建 BERT 模型的 “蓝图”，确保模型各部分在初始化时能基于统一的参数协同工作，同时支持灵活调整模型规模（如基础版、大型版）。

BERT 模型的结构复杂（包含嵌入层、多层 Transformer、注意力机制等），其性能和规模由多个关键参数共同决定。BertConfig 的核心作用是：将这些参数集中存储和管理，避免参数分散在模型代码的各个部分导致混乱；同时提供统一的参数访问接口，方便在构建模型时快速获取所需配置（如嵌入层维度、注意力头数等）。

无论是构建基础版 BERT（base）还是大型版 BERT（large），只需修改BertConfig的参数即可，无需改动模型结构代码，极大提升了模型的可扩展性。

class BertConfig:
    def __init__(self, vocab_size=30522, hidden_size=768, max_position_embeddings=512,
                 type_vocab_size=2, num_heads=12, intermediate_size=3072,
                 num_hidden_layers=12, dropout=0.1):
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.max_position_embeddings = max_position_embeddings
        self.type_vocab_size = type_vocab_size
        self.num_heads = num_heads
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.dropout = dropout

BertConfig的__init__方法定义了 10 个核心参数，每个参数对应 BERT 模型某一组件的关键属性，按模型结构分层说明如下：

1. 嵌入层相关参数（文本→向量的基础转换）

vocab_size：词汇表大小（默认 30522）对应 BERT 的word_embeddings层（词嵌入），决定了词嵌入矩阵的行数（vocab_size × hidden_size）。需与分词器的词汇表大小严格一致（如BertManualTokenizer的vocab_size），否则会出现 “词索引超出嵌入层范围” 的错误。
max_position_embeddings：最大位置嵌入长度（默认 512）对应position_embeddings层（位置嵌入），表示模型支持的最长输入序列长度。超过该长度的文本会被截断（如TextProcessor中的max_seq_len通常设为该值），确保位置索引不超出嵌入层范围。
type_vocab_size：句子类型词汇表大小（默认 2）对应token_type_embeddings层（句子类型嵌入），用于区分句子对中的两个句子（如token_type_ids的 0 和 1）。BERT 中仅需区分 “句 1” 和 “句 2”，因此默认值为 2。

2. Transformer 层通用参数（模型核心表示能力）

hidden_size：隐藏层维度（默认 768）这是 BERT 最核心的维度参数，决定了模型的基础表示能力。所有 Transformer 层的输入 / 输出维度、注意力机制的query/key/value维度均基于此值（如query维度 = hidden_size / num_heads）。基础版 BERT 用 768，大型版用 1024。
dropout：dropout 概率（默认 0.1）用于模型各层的正则化（如嵌入层后、注意力输出后、前馈网络输出后），防止过拟合。值越大，正则化强度越高（但可能导致欠拟合）。

3. 多头注意力机制参数（特征交互能力）

num_heads：注意力头数（默认 12）多头注意力将hidden_size维度的特征拆分为num_heads个并行的子空间（每个子空间维度 = hidden_size / num_heads），增强模型对不同特征的捕捉能力。基础版 BERT 用 12 头，大型版用 16 头，需满足hidden_size % num_heads == 0（否则子空间维度非整数）。

4. 前馈网络参数（特征转换能力）

intermediate_size：前馈网络中间层维度（默认 3072）BERT 的 Transformer 层中，前馈网络采用 “hidden_size → intermediate_size → hidden_size” 的结构，中间层维度通常是hidden_size的 4 倍（768×4=3072），为特征提供更丰富的非线性转换。

5. 模型深度参数（特征抽象能力）

num_hidden_layers：Transformer 隐藏层数（默认 12）决定模型的深度，层数越多，模型对特征的抽象能力越强（但训练难度和计算成本越高）。基础版 BERT 用 12 层，大型版用 24 层。

BertConfig是 BERT 模型的 “配置中枢”，通过定义词汇表大小、隐藏层维度、注意力头数等核心参数，为模型的构建提供了明确的 “施工图纸”。它不仅确保了模型各组件的兼容性，还支持灵活调整模型规模，是 BERT 实现模块化、可扩展设计的关键组件。

第3步：`BertEmbeddings类实现`

BertEmbeddings 是 BERT 模型的嵌入层模块，继承自 PyTorch 的nn.Module。它的核心功能是将离散的文本输入（如input_ids、token_type_ids）转换为连续的向量表示，同时编码文本的词汇信息、位置信息和句子边界信息，为后续的 Transformer 层提供初始的语义向量输入。

在 BERT 模型中，嵌入层是连接 “离散符号” 与 “连续向量” 的桥梁。原始文本经过分词后得到的是整数 ID（如input_ids），这些 ID 本身不包含语义信息，BertEmbeddings 通过三个子嵌入层的组合，将这些 ID 转换为富含语义的向量，并编码语言的序列特性（词序）和句子结构（句子边界），为模型后续的特征提取（Transformer 层）奠定基础。

class BertEmbeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(config.dropout)
        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
    def forward(self, input_ids, token_type_ids=None):
        position_ids = self.position_ids[:, :input_ids.size(1)]
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)
        embeddings = self.word_embeddings(input_ids) + self.position_embeddings(
            position_ids) + self.token_type_embeddings(token_type_ids)
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

1.初始化方法 `init`：构建嵌入层核心组件

核心组件解析：

组件	作用说明
`word_embeddings`	将`input_ids`（token 的整数 ID）转换为词向量，是语义的基础载体。`padding_idx=0`确保`[PAD]`的嵌入向量固定（不更新），避免填充位置干扰语义。
`position_embeddings`	编码 token 在序列中的位置（如第 1 个、第 2 个 token），解决 Transformer 模型 “对词序不敏感” 的问题（Transformer 本身是并行计算，无法直接捕捉顺序）。
`token_type_embeddings`	区分句子对中的两个句子（如`token_type_ids=0`表示句 1，`=1`表示句 2），为 NSP 任务提供句子边界信息。
`LayerNorm`	对嵌入向量进行层归一化（使向量均值为 0、方差为 1），稳定各层输入的分布，加速模型训练。
`dropout`	随机将部分嵌入向量置 0（概率为`config.dropout`），防止模型过度依赖某些特定 token 的嵌入，增强泛化能力。
`position_ids`	预定义的位置索引（`[0,1,2,...,max_position_embeddings-1]`），作为`position_embeddings`的输入，无需随训练更新（用`register_buffer`标记为非参数张量）。

2.前向传播方法 `forward`：生成最终嵌入向量

前向传播流程详解：

位置索引适配：position_ids 原本是长度为max_position_embeddings的全局索引，通过[:, :input_ids.size(1)]截取与输入序列长度（seq_len）一致的部分（如输入序列长度为 128，则取[0,1,...,127]），确保位置嵌入与输入序列长度匹配。
句子类型 ID 兜底：若未提供token_type_ids（如单句输入场景），默认生成全 0 张量，所有 token 均被标记为 “句 1”，避免输入缺失导致错误。
嵌入向量融合：词嵌入、位置嵌入、句子类型嵌入的形状均为(batch_size, seq_len, hidden_size)，三者直接相加（element-wise addition），实现三种信息的初步融合。这是 BERT 的设计特点：通过简单加法而非复杂拼接来融合多维度信息，减少参数和计算量。
归一化与正则化：
- LayerNorm 对融合后的嵌入向量进行归一化，避免因不同嵌入的数值范围差异导致训练不稳定；
- dropout 随机失活部分神经元，防止模型过度拟合训练数据中的噪声。

BertEmbeddings 是 BERT 模型的 “输入转换器”，通过词嵌入、位置嵌入和句子类型嵌入的融合，将离散的文本 ID 转换为富含多维度信息的连续向量，并通过归一化和正则化确保训练稳定性。它为整个 BERT 模型提供了高质量的初始特征表示，是模型理解文本语义的基础。

第4步：`BertSelfAttention类实现`

BertSelfAttention 是 BERT 模型 Transformer 层的核心组件，继承自 PyTorch 的nn.Module，专门实现多头自注意力机制（Multi-Head Self-Attention）。其核心功能是让序列中的每个 token “关注” 序列内所有其他 token，捕捉 token 间的依赖关系（如语法关联、语义关联），并通过 “多头” 设计同时捕捉不同维度的注意力信息（如有的头关注局部语法，有的头关注全局语义），最终生成融合注意力信息的特征向量。

Transformer 模型的核心是自注意力机制，而 BERT 的每个 Transformer 层都以 “多头自注意力” 为核心。BertSelfAttention 的作用是：接收来自嵌入层（或上一层 Transformer）的隐藏态（hidden_states），通过计算 token 间的注意力权重，动态整合每个 token 的全局上下文信息，解决传统 RNN 模型 “无法并行计算” 和 “长距离依赖捕捉弱” 的问题，为 BERT 的语义理解能力提供基础。

1.初始化方法 `init`：构建多头自注意力的基础组件

class BertSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.num_heads = config.num_heads
        self.hidden_size = config.hidden_size
        self.head_size = config.hidden_size // config.num_heads
        assert self.head_size * self.num_heads == self.hidden_size, "hidden_size must be divisible by num_heads"
        self.query = nn.Linear(config.hidden_size, config.hidden_size)
        self.key = nn.Linear(config.hidden_size, config.hidden_size)
        self.value = nn.Linear(config.hidden_size, config.hidden_size)
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.dropout)
    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_heads, self.head_size)
        x = x.view(new_x_shape)
        return x.permute(0, 2, 1, 3)
    def forward(self, hidden_states, attention_mask=None):
        mixed_query_layer = self.query(hidden_states)
        mixed_key_layer = self.key(hidden_states)
        mixed_value_layer = self.value(hidden_states)
        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        attention_scores = attention_scores / torch.sqrt(torch.tensor(self.head_size, dtype=torch.float32))
        if attention_mask is not None:
            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
            attention_scores = attention_scores + (attention_mask * -10000.0)
        attention_probs = nn.Softmax(dim=-1)(attention_scores)
        attention_probs = self.dropout(attention_probs)
        context_layer = torch.matmul(attention_probs, value_layer)
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size,)
        context_layer = context_layer.view(new_context_layer_shape)
        output = self.dense(context_layer)
        return output

核心组件解析：

参数关系：hidden_size = num_heads × head_size，这是多头注意力的基础 —— 需将hidden_size维度的隐藏态拆分为num_heads个并行的子空间（每个子空间维度为head_size），实现 “多头并行关注”。
Q/K/V 线性层：自注意力的核心是 “通过 Q 与 K 的相似度计算注意力权重，再用权重对 V 加权求和”。这三个线性层的作用是将原始隐藏态（融合了词、位置、句子信息）分别映射为 Q、K、V 向量，为后续注意力计算做准备。
dense 线性层：多头注意力计算完成后，需将num_heads个head_size维度的结果拼接回hidden_size维度，该层负责拼接后的特征整合。

2.辅助函数 `transpose_for_scores`：拆分多头维度

功能详解：

为什么需要维度转换？原始 Q/K/V 向量的形状是(batch_size, seq_len, hidden_size)，需拆分为 “多头” 格式才能并行计算每个头的注意力：
1. reshape：将hidden_size拆分为(num_heads, head_size)，形状变为(batch, seq_len, num_heads, head_size)；
2. permute：将num_heads维度提前到seq_len之前，变为(batch, num_heads, seq_len, head_size)—— 这样后续计算注意力分数时，可按 “头” 并行处理（每个头独立计算seq_len×seq_len的注意力矩阵）。
示例形状变化（以batch=32、seq_len=128、num_heads=12、head_size=64为例）：(32, 128, 768) → (32, 128, 12, 64) → (32, 12, 128, 64)。

3.前向传播方法 `forward`：完整多头自注意力计算流程

关键步骤拆解与原理：

1. 注意力分数计算

核心公式：Attention Scores = Q × K^T / √d_k（d_k=head_size）
- Q × K^T：计算每个 token 的 Q 与所有 token 的 K 的点积，结果形状为(batch, num_heads, seq_len, seq_len)—— 其中scores[i][j][m][n]表示 “第 i 个 batch、第 j 个头中，第 m 个 token 对第 n 个 token 的关注度”。
- 除以√d_k：当d_k较大时，点积结果会过大，导致 softmax 后梯度趋近于 0（梯度消失）。缩放后可让分数分布更平缓，避免该问题。

2. 注意力掩码处理

掩码作用：输入序列中的[PAD]（填充 token）是无效的，需让模型 “忽略” 这些 token（不关注它们）。
掩码实现：
- 原始attention_mask是(batch, seq_len)，值为 0（PAD）或 1（有效 token）；
- 通过unsqueeze(1).unsqueeze(2)扩展为(batch, 1, 1, seq_len)，适配注意力分数的形状；
- 无效位置加-10000.0：softmax 函数对极大的负数输出接近 0 的概率，从而实现 “屏蔽无效 token”。

3. 上下文向量计算

加权求和：context_layer = Attention Probs × V—— 用注意力权重对 V 向量加权，让关注度高的 token 的 V 向量贡献更多。
多头拼接：每个头的context_layer是(batch, seq_len, head_size)，需将num_heads个结果拼接为(batch, seq_len, hidden_size)（即num_heads×head_size），恢复为与输入隐藏态一致的维度，以便后续层处理。
contiguous()的作用：permute（转置）会导致张量内存不连续，view（重塑）前需用contiguous()整理内存，否则会报错。

BertSelfAttention 通过实现 “多头自注意力机制”，将 BERT 的隐藏态转换为融合全局上下文信息的特征向量，是 BERT 模型语义理解能力的核心来源。其设计严格遵循自注意力的数学原理，同时通过多头拆分、掩码处理、正则化等细节，确保模型在捕捉依赖关系的同时，兼顾训练稳定性和泛化能力。

第5步：`BertLayer` 类实现

BertLayer 是 BERT 模型中 Transformer 编码器的核心组成单元，继承自 PyTorch 的nn.Module。它完整实现了 Transformer 编码器的一个 “层” 结构，包含两个核心子层：多头自注意力子层和前馈网络子层，并通过残差连接（Residual Connection） 和层归一化（Layer Normalization） 增强训练稳定性，最终输出经过深层特征提取的序列表示。

BERT 模型的主体由多个堆叠的BertLayer组成（如基础版 BERT 有 12 层，大型版有 24 层）。每个BertLayer的作用是：接收上一层（或嵌入层）输出的特征向量，通过自注意力机制捕捉序列内的全局依赖关系，再通过前馈网络进行局部特征的非线性转换，最终输出更抽象、更丰富的特征表示。多层堆叠后，模型能从浅到深逐步提取文本的复杂语义（如从字词级特征到句法、语义级特征）。

class BertLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attention = BertSelfAttention(config)
        self.attention_output_layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.attention_dropout = nn.Dropout(config.dropout)
        self.intermediate = nn.Linear(config.hidden_size, config.intermediate_size)
        self.intermediate_act_fn = nn.GELU()
        self.output = nn.Linear(config.intermediate_size, config.hidden_size)
        self.output_layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.output_dropout = nn.Dropout(config.dropout)
    def forward(self, hidden_states, attention_mask=None):
        attention_output = self.attention(hidden_states, attention_mask)
        attention_output = self.attention_dropout(attention_output)
        attention_output = self.attention_output_layer_norm(hidden_states + attention_output)
        intermediate_output = self.intermediate(attention_output)
        intermediate_output = self.intermediate_act_fn(intermediate_output)
        layer_output = self.output(intermediate_output)
        layer_output = self.output_dropout(layer_output)
        layer_output = self.output_layer_norm(attention_output + layer_output)
        return layer_output

1.初始化方法 `init`：构建 Transformer 层的核心组件

核心组件解析：BertLayer的结构严格遵循 Transformer 编码器的设计，包含两个串联的子层，每个子层都由 “核心模块 + dropout + 残差连接 + 层归一化” 组成：

组件	所属子层	作用说明
`attention`	自注意力子层	即`BertSelfAttention`模块，负责计算序列内 token 间的注意力，捕捉全局依赖。
`attention_dropout`	自注意力子层	对注意力输出进行 dropout，防止过拟合。
`attention_output_layer_norm`	自注意力子层	对 “输入 + 注意力输出” 的残差结果进行层归一化，稳定特征分布。
`intermediate`	前馈网络子层	线性层，将注意力子层的输出（`hidden_size`）映射到更高维度（`intermediate_size`，通常是`hidden_size`的 4 倍），扩展特征容量。
`intermediate_act_fn`	前馈网络子层	GELU 激活函数，引入非线性转换，增强模型对复杂特征的表达能力。
`output`	前馈网络子层	线性层，将高维特征（`intermediate_size`）映射回`hidden_size`，与输入维度保持一致。
`output_dropout`	前馈网络子层	对前馈网络输出进行 dropout，进一步防止过拟合。
`output_layer_norm`	前馈网络子层	对 “注意力子层输出 + 前馈网络输出” 的残差结果进行层归一化，稳定最终输出分布。

2.前向传播方法 `forward`：完整的 Transformer 层计算流程

流程详解：BertLayer的前向传播分为两个核心阶段，每个阶段都遵循 “核心计算→dropout→残差连接→层归一化” 的设计，这是 Transformer 能稳定训练深层网络的关键：

1. 自注意力子层：捕捉全局依赖

核心计算：通过BertSelfAttention计算每个 token 对序列中所有 token 的注意力，生成融合全局上下文的特征（attention_output）。
dropout：随机失活部分特征，防止模型过度依赖某些注意力模式。
残差连接：hidden_states + attention_output—— 将原始输入与注意力输出相加，让网络更容易学习 “输入与输出的差异”（而非从头学习完整特征），缓解深层网络的梯度消失问题。
层归一化：对残差结果进行归一化（使均值为 0、方差为 1），避免特征数值波动过大影响后续计算。

2. 前馈网络子层：增强局部特征转换

升维与激活：intermediate层将特征从hidden_size扩展到intermediate_size（如 768→3072），为特征提供更丰富的非线性转换空间；GELU激活函数引入非线性，让模型能捕捉更复杂的特征模式（比 ReLU 更平滑，训练更稳定）。
降维：output层将高维特征映射回hidden_size，确保输出维度与输入一致（便于多层堆叠）。
dropout 与残差归一化：与注意力子层逻辑一致，进一步正则化并稳定特征分布。

BertLayer是 BERT 模型的 “基础功能单元”，通过 “自注意力子层 + 前馈网络子层” 的组合，实现了对输入特征的深层加工 —— 既捕捉全局依赖，又增强局部特征转换，同时通过残差连接和层归一化确保深层网络的可训练性。多个BertLayer的堆叠是 BERT 能实现强大语义理解能力的核心原因。

第6步：`BertEncoder类实现`

BertEncoder 是 BERT 模型的编码器核心组件，继承自 PyTorch 的nn.Module。它通过堆叠多个BertLayer（Transformer 层），将嵌入层输出的初始特征向量进行深层递进式的特征提取，最终生成包含文本全局语义和复杂依赖关系的高级特征表示，是 BERT 模型实现强大语义理解能力的核心模块。

在 BERT 模型架构中，BertEncoder 处于嵌入层（BertEmbeddings）和预测头（如 MLM 头、NSP 头）之间，扮演 “特征加工中枢” 的角色。其核心功能是：接收嵌入层输出的 “基础特征”（融合词、位置、句子类型信息），通过多层 Transformer 层的迭代处理，逐步将浅层特征（如字词表面信息）转化为深层特征（如句法结构、语义关系、上下文关联等），为后续的预训练任务（MLM 和 NSP）提供高质量的特征输入。

class BertEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])
    def forward(self, hidden_states, attention_mask=None):
        for layer_module in self.layer:
            hidden_states = layer_module(hidden_states, attention_mask)
        return hidden_states

1.初始化方法 `init`：构建多层 Transformer 堆叠结构

核心逻辑解析：

nn.ModuleList 是 PyTorch 专门用于存储子模块的容器，它会自动注册其中的子模块，确保这些子模块的参数能被优化器正确捕获（参与训练更新）。
堆叠数量由config.num_hidden_layers决定：这是 BERT 模型规模的关键参数（基础版 BERT 为 12 层，大型版为 24 层），层数越多，模型对特征的抽象能力越强（但计算成本和训练难度也越高）。
所有BertLayer共享同一套配置（config）：确保每层的隐藏层维度（hidden_size）、注意力头数（num_heads）等参数一致，保证特征在层间传递时维度兼容。

2.前向传播方法 `forward`：多层递进式特征提取

流程详解：

递进式特征加工：hidden_states 从第一层BertLayer输入，经过处理后输出的特征作为第二层的输入，以此类推，直到最后一层。这种 “流水线” 式的处理让特征在每一层都得到进一步的抽象和优化。
注意力掩码的贯穿：attention_mask 被传递到每一层BertLayer，确保在所有层的注意力计算中，[PAD]等无效 token 都被屏蔽（不参与上下文依赖计算），避免无效信息干扰特征提取。
特征形状不变性：每一层BertLayer的输入和输出形状均为(batch_size, seq_len, hidden_size)，因此经过多层堆叠后，BertEncoder的输入和输出形状保持一致，仅特征的语义表达能力得到增强。

BertEncoder 是 BERT 模型的 “特征提取核心”，通过堆叠多个BertLayer实现了对文本特征的层级递进式加工。它将嵌入层输出的基础特征逐步转化为包含全局语义和复杂依赖关系的高级特征，是 BERT 能理解复杂语言结构、在多种自然语言处理任务中表现优异的关键所在。其设计体现了 “深层网络 + 注意力机制” 的核心思想，通过多层迭代让模型逐步揭示文本的深层语义。

第7步：`BertModel` 类实现

BertModel 是 BERT 模型的主体框架，继承自 PyTorch 的nn.Module。它整合了 BERT 的核心组件（嵌入层、编码器、池化层），实现了从原始文本输入到高级特征输出的完整流程，是 BERT 预训练（支持 MLM 和 NSP 任务）和下游任务微调的基础模型。其核心功能是将离散的文本 ID 转换为包含全局语义和上下文信息的特征向量，为后续的预测任务提供高质量的输入。

在 BERT 模型架构中，BertModel 是连接 “原始文本” 与 “任务预测” 的核心枢纽。它接收分词后的文本 ID（input_ids等），通过嵌入层将其转换为初始向量，再通过多层 Transformer 编码器进行深层特征提取，最后输出两种关键特征：序列级特征（每个 token 的向量表示）和句子级特征（整个序列的整体表示）。这两种特征分别用于不同任务（如 MLM 依赖序列特征，NSP 或分类任务依赖句子特征），使 BERT 具备强大的通用语义理解能力。

class BertModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.embeddings = BertEmbeddings(config)
        self.encoder = BertEncoder(config)
        self.pooler = nn.Linear(config.hidden_size, config.hidden_size)
        self.pooler_activation = nn.Tanh()
    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
        embedding_output = self.embeddings(input_ids, token_type_ids)
        sequence_output = self.encoder(embedding_output, attention_mask)
        pooled_output = self.pooler(sequence_output[:, 0, :])
        pooled_output = self.pooler_activation(pooled_output)
        return sequence_output, pooled_output

1.初始化方法 `init`：构建 BERT 的完整组件链

核心组件解析：BertModel的结构遵循 “输入层→特征提取层→输出层” 的经典深度学习范式，各组件分工明确且紧密衔接：

组件	作用说明
`embeddings`	即`BertEmbeddings`模块，负责将`input_ids`（token ID）、`token_type_ids`（句子类型）转换为融合词嵌入、位置嵌入、句子类型嵌入的初始向量（`embedding_output`）。
`encoder`	即`BertEncoder`模块，由`num_hidden_layers`个`BertLayer`（Transformer 层）堆叠而成，对初始向量进行深层特征提取，输出包含全局上下文信息的序列特征（`sequence_output`）。
`pooler`	线性层，将序列中`[CLS]` token 的特征（`sequence_output[:, 0, :]`）从`hidden_size`维度映射到同维度（参数化转换，增强表达能力）。
`pooler_activation`	Tanh 激活函数，将池化后的特征压缩到 [-1, 1] 范围，使句子级特征更稳定（作为 NSP 任务或下游分类任务的输入）。

2.前向传播方法 `forward`：从文本到特征的完整流程

流程详解：BertModel的前向传播是 BERT 模型 “理解文本” 的核心过程，可分为三个递进阶段：

1. 嵌入层：文本→初始向量

self.embeddings(input_ids, token_type_ids) 将离散的输入 ID 转换为连续向量：

输入input_ids（如[101, 2182, 2310, ..., 102]）通过词嵌入、位置嵌入、句子类型嵌入的融合，生成embedding_output（(batch, seq_len, hidden_size)），包含文本的基础语义和结构信息。

2. 编码器：初始向量→深层序列特征

self.encoder(embedding_output, attention_mask) 对初始向量进行深层加工：

经过num_hidden_layers个BertLayer（Transformer 层）的迭代处理，embedding_output被逐步转换为sequence_output—— 每个 token 的向量不仅包含自身信息，还融合了序列中所有其他 token 的上下文信息（通过自注意力机制捕捉）。
attention_mask贯穿整个编码器，确保所有层都屏蔽[PAD]等无效 token，避免干扰特征提取。

3. 池化层：序列特征→句子级特征

sequence_output[:, 0, :] 取序列第一个 token（即[CLS]）的特征，经过pooler线性层和Tanh激活后得到pooled_output：

[CLS]是 BERT 在输入序列前添加的特殊 token，其初始嵌入无实际语义，但在编码器的多层处理中，会逐渐聚合整个序列的全局信息（可理解为 “句子的整体摘要”）。
pooled_output 作为句子级特征，适用于需要整体语义的任务（如 NSP 任务判断两句话是否连续，或下游分类任务如情感分析）。

BertModel 是 BERT 模型的 “主体引擎”，通过嵌入层、编码器和池化层的协同工作，将原始文本 ID 转换为包含丰富语义信息的特征向量。它既输出每个 token 的上下文特征（支持 token 级任务），又输出整个序列的整体特征（支持句子级任务），是 BERT 实现强大通用语义理解能力的核心载体，也是 “预训练 - 微调” 范式能够在众多 NLP 任务中取得成功的基础。

第8步：`BertPretrainingHeads类实现`

BertPretrainingHeads 是 BERT 预训练任务的预测头模块，继承自 PyTorch 的nn.Module。它专门为 BERT 的两个核心预训练任务 ——掩码语言模型（MLM，Masked Language Modeling） 和下一句预测（NSP，Next Sentence Prediction） 提供预测输出，将BertModel输出的高级特征转换为具体的任务预测结果，是 BERT 预训练过程中计算损失、更新模型参数的关键组件。

在 BERT 的预训练框架中，BertModel负责将文本转换为富含语义的特征向量（sequence_output和pooled_output），而BertPretrainingHeads则基于这些特征完成两个核心任务的预测：

MLM 任务：预测被随机掩码的 token 的原始值（如将 “我 [MASK] 苹果” 中的[MASK]预测为 “吃”），迫使模型学习上下文语义关联；
NSP 任务：判断输入的两个句子是否为连续的上下文（如 “我喜欢跑步” 和 “它能锻炼身体” 是连续的，而和 “天空是蓝色的” 是非连续的），迫使模型学习句子间的逻辑关系。

这两个任务共同作用，使 BERT 能同时捕捉 token 级和句子级的语义信息，为后续下游任务的微调奠定基础。

class BertPretrainingHeads(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.GELU()
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.mlm_head = nn.Linear(config.hidden_size, config.vocab_size)
        self.nsp_head = nn.Linear(config.hidden_size, 2)
    def forward(self, sequence_output, pooled_output):
        mlm_hidden = self.dense(sequence_output)
        mlm_hidden = self.activation(mlm_hidden)
        mlm_hidden = self.LayerNorm(mlm_hidden)
        mlm_logits = self.mlm_head(mlm_hidden)
        nsp_logits = self.nsp_head(pooled_output)
        return mlm_logits, nsp_logits

1.初始化方法 `init`：构建双任务预测头

核心组件解析：BertPretrainingHeads针对两个任务的特点设计了不同的预测路径，其中 MLM 任务需要更复杂的特征转换（因需预测具体 token），而 NSP 任务为简单二分类：

组件	所属任务	作用说明
`dense`	MLM	线性层，对`sequence_output`（token 级特征）进行参数化转换，增强特征的表达能力（从`hidden_size`到`hidden_size`，维度不变但特征重组）。
`activation`	MLM	GELU 激活函数，为 MLM 的特征转换引入非线性，使模型能捕捉更复杂的上下文语义模式（比 ReLU 更平滑，适合预训练）。
`LayerNorm`	MLM	层归一化，对转换后的 MLM 特征进行归一化（均值 0、方差 1），稳定特征分布，避免后续预测头输出波动过大。
`mlm_head`	MLM	线性层，将处理后的 MLM 特征（`hidden_size`）映射到词汇表大小（`vocab_size`），输出每个 token 被预测为词汇表中任意词的 “对数概率”（logits）。
`nsp_head`	NSP	线性层，将`pooled_output`（句子级特征，`hidden_size`）直接映射到 2 维，输出 “两句话连续” 和 “不连续” 的对数概率（logits）。

2.前向传播方法 `forward`：双任务预测流程

流程详解：BertPretrainingHeads的前向传播分别处理 MLM 和 NSP 任务，利用BertModel输出的两种特征生成预测结果：

1. MLM 任务：预测掩码 token

特征增强：sequence_output（每个 token 的深层特征）先通过dense层进行特征重组，再经GELU激活引入非线性，最后用LayerNorm稳定分布 —— 这一系列处理让特征更适合预测具体 token（相比原始sequence_output，转换后的特征针对性更强）。
预测输出：mlm_head将处理后的特征映射到vocab_size维度，每个位置的输出mlm_logits[i][j][k]表示 “第 i 个样本、第 j 个 token 被预测为词汇表中第 k 个词的对数概率”。后续通过softmax可转换为概率分布，用于计算 MLM 损失（通常是交叉熵损失）。

2. NSP 任务：判断句子连续性

直接预测：pooled_output是[CLS] token 的特征（聚合了整个序列的全局信息），无需复杂转换，直接通过nsp_head映射到 2 维 ——nsp_logits[i][0]表示 “两句话不连续” 的对数概率，nsp_logits[i][1]表示 “连续” 的对数概率。后续通过softmax转换为概率，用于计算 NSP 损失（二分类交叉熵损失）。

BertPretrainingHeads 是 BERT 预训练的 “任务执行终端”，通过分别设计 MLM 和 NSP 任务的预测头，将BertModel输出的高级特征转换为具体的任务预测结果。它实现了 BERT 预训练的核心目标 —— 让模型同时学习上下文语义和句子关系，为后续下游任务的微调提供了强大的通用语义表示基础。其设计体现了 “共享特征提取 + 任务专属预测” 的思想，兼顾了预训练效率和多任务学习的需求。

第9步：`BertForPretraining类实现`

BertForPretraining 是 BERT 预训练的完整端到端模型，继承自 PyTorch 的nn.Module。它整合了 BERT 的核心特征提取模块（BertModel）和预训练任务预测头（BertPretrainingHeads），并实现了预训练损失的计算逻辑，是 BERT 模型进行预训练时的直接调用对象。其核心功能是接收原始文本输入和任务标签，输出预训练总损失（MLM 损失 + NSP 损失）和预测结果，支撑整个预训练过程的参数更新。

在 BERT 的预训练框架中，BertForPretraining 是 “承上启下” 的最终执行单元：

承上：整合BertModel（特征提取）和BertPretrainingHeads（任务预测），将前面所有组件（嵌入层、编码器、预测头等）串联成完整的预训练流程；
启下：直接对接训练循环（如train_loop），接收预处理后的文本数据（input_ids等）和标签（mlm_labels、nsp_labels），输出损失值用于反向传播更新模型参数，同时输出预测结果用于监控训练效果。

简单来说，它是 BERT 预训练时 “拿来就能用” 的完整模型，封装了从输入到损失计算的所有细节。

class BertForPretraining(nn.Module):
    def __init__(self, bert_model):
        super().__init__()
        self.bert = bert_model
        self.cls = BertPretrainingHeads(bert_model.config)
    def forward(self, input_ids, token_type_ids=None, attention_mask=None, mlm_labels=None, nsp_labels=None):
        sequence_output, pooled_output = self.bert(input_ids, token_type_ids, attention_mask)
        mlm_logits, nsp_logits = self.cls(sequence_output, pooled_output)
        total_loss = None
        if mlm_labels is not None and nsp_labels is not None:
            mlm_loss_fct = nn.CrossEntropyLoss()
            nsp_loss_fct = nn.CrossEntropyLoss()
            mlm_loss = mlm_loss_fct(mlm_logits.view(-1, self.bert.config.vocab_size), mlm_labels.view(-1))
            nsp_loss = nsp_loss_fct(nsp_logits.view(-1, 2), nsp_labels.view(-1))
            total_loss = mlm_loss + nsp_loss
        return total_loss, mlm_logits, nsp_logits

1.初始化方法 `init`：整合预训练核心组件

核心逻辑解析：

self.bert：接收一个已初始化的BertModel实例，负责将输入文本转换为高级特征（sequence_output和pooled_output），是模型的 “特征提取器”；
self.cls：初始化BertPretrainingHeads（分类头），基于BertModel的配置（bert_model.config）构建，负责将特征转换为 MLM 和 NSP 任务的预测结果（logits）。

这种设计体现了 “模块化复用” 的思想：BertModel可独立用于特征提取（如下游任务微调），而BertForPretraining通过组合它与预测头，专门用于预训练。

2.前向传播方法 `forward`：完整预训练流程（从输入到损失）

流程详解：BertForPretraining的前向传播包含三个核心阶段，覆盖了预训练的完整逻辑：

1. 特征提取（依赖`BertModel`）

self.bert(...) 将输入的文本 ID（input_ids等）转换为两种特征：

sequence_output：(batch_size, seq_len, hidden_size)，每个 token 的上下文特征，用于 MLM 任务；
pooled_output：(batch_size, hidden_size)，句子级整体特征，用于 NSP 任务。

2. 任务预测（依赖`BertPretrainingHeads`）

self.cls(...) 基于特征输出两个任务的预测结果：

mlm_logits：(batch_size, seq_len, vocab_size)，每个 token 被预测为词汇表中任意词的对数概率；
nsp_logits：(batch_size, 2)，两句话 “连续” 或 “不连续” 的对数概率。

3. 损失计算（训练核心）

仅当提供mlm_labels和nsp_labels时（训练阶段）才计算损失：

MLM 损失：通过交叉熵损失比较mlm_logits与mlm_labels（被掩码 token 的真实 ID），衡量模型对上下文 token 的预测能力；
NSP 损失：通过交叉熵损失比较nsp_logits与nsp_labels（句子连续性的真实标签），衡量模型对句子关系的判断能力；
总损失：两者之和，作为预训练的优化目标（反向传播时基于此更新所有参数）。

BertForPretraining 是 BERT 预训练的 “最终执行载体”，通过整合特征提取模块和任务预测头，实现了从文本输入到预训练损失的端到端流程。它不仅简化了预训练的调用逻辑，还通过双任务联合损失的设计，确保模型能同时学习 token 级和句子级的语义信息，为后续下游任务的微调奠定了强大的通用语义表示基础。

（四）训练模型并可视化

第1步：`train_bert函数实现`

train_bert 是 BERT 模型预训练的完整训练流程函数，封装了从设备配置、数据加载、优化器设置到训练循环、损失监控、模型保存的全流程逻辑。它专为 BERT 预训练任务（MLM+NSP）设计，支持混合精度训练、梯度累积、学习率调度等关键训练策略，同时提供训练过程可视化，确保模型能稳定高效地完成预训练。

在 BERT 预训练 pipeline 中，该函数是 “执行中枢”，负责将前面定义的模型（BertForPretraining）、预处理后的训练数据（train_dataset）与训练策略（学习率、批次大小等）整合，通过多轮迭代优化模型参数，最终输出预训练完成的模型。其核心目标是：在给定数据集上最小化 MLM+NSP 的联合损失，使模型学习到通用的语义表示。

def train_bert(model, train_dataset, batch_size=16, num_train_epochs=4, learning_rate=2e-5,
               output_dir="./bert_pretrained", vis_output_dir="./bert_visualizations"):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"使用设备：{device}")
    model.to(device)
    # 初始化可视化工具（自动处理中文字体）
    visualizer = BertTrainVisualizer(vis_output_dir)
    # 数据加载器
    train_dataloader = data.DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=0
    )
    print(f"数据加载器：共{len(train_dataloader)}个batch")
    # 优化器与调度器
    optimizer = AdamW(model.parameters(), lr=learning_rate, eps=1e-8)
    total_steps = len(train_dataloader) * num_train_epochs
    scheduler = LinearLR(optimizer, start_factor=1.0, end_factor=0.0, total_iters=total_steps)
    # 适配PyTorch新旧版本的混合精度API（消除FutureWarning）
    use_amp = device.type == "cuda"
    scaler, autocast_context = None, None
    if use_amp:
        # 检查PyTorch版本：>=2.0使用torch.amp，否则使用torch.cuda.amp
        torch_version = torch.__version__.split(".")
        major, minor = int(torch_version[0]), int(torch_version[1])
        if major >= 2 or (major == 1 and minor >= 10):
            # 新版本：使用torch.amp（推荐）
            scaler = torch.amp.GradScaler(device_type=device.type)
            autocast_context = torch.amp.autocast(device_type=device.type)
            print("启用混合精度训练（torch.amp）")
        else:
            # 旧版本：使用torch.cuda.amp（兼容）
            scaler = torch.cuda.amp.GradScaler()
            autocast_context = torch.cuda.amp.autocast()
            print("启用混合精度训练（torch.cuda.amp）")
    else:
        print("CPU环境不支持混合精度，禁用")
    # 创建模型输出目录
    os.makedirs(output_dir, exist_ok=True)
    model.train()
    for epoch in range(num_train_epochs):
        total_loss = 0.0
        progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch + 1}/{num_train_epochs}")
        for step, batch in enumerate(progress_bar):
            batch = {k: v.to(device) for k, v in batch.items()}
            optimizer.zero_grad()  # 提前清零梯度，避免累积错误
            # 混合精度训练上下文（兼容新旧API）
            if use_amp:
                with autocast_context:
                    loss, _, _ = model(
                        input_ids=batch["input_ids"],
                        token_type_ids=batch["token_type_ids"],
                        attention_mask=batch["attention_mask"],
                        mlm_labels=batch["mlm_labels"],
                        nsp_labels=batch["nsp_labels"]
                    )
            else:
                # CPU环境：不使用混合精度
                loss, _, _ = model(
                    input_ids=batch["input_ids"],
                    token_type_ids=batch["token_type_ids"],
                    attention_mask=batch["attention_mask"],
                    mlm_labels=batch["mlm_labels"],
                    nsp_labels=batch["nsp_labels"]
                )
            # 梯度累积（保持原有逻辑）
            loss_accum = loss / 2
            loss_actual = loss.item() * 2
            total_loss += loss_accum.item()
            # 记录指标
            current_lr = optimizer.param_groups[0]["lr"]
            visualizer.record_step(loss_actual, current_lr)
            # 反向传播与梯度更新
            if use_amp:
                scaler.scale(loss_accum).backward()
                # 梯度累积步数：每2步更新一次
                if (step + 1) % 2 == 0 or step == len(train_dataloader) - 1:
                    scaler.step(optimizer)
                    scaler.update()
                    scheduler.step()
            else:
                loss_accum.backward()
                if (step + 1) % 2 == 0 or step == len(train_dataloader) - 1:
                    optimizer.step()
                    scheduler.step()
            # 更新进度条
            progress_bar.set_postfix({"step_loss": f"{loss_actual:.4f}", "lr": f"{current_lr:.6f}"})
        # 计算平均损失
        avg_loss = (total_loss / len(train_dataloader)) * 2
        print(f"Epoch {epoch + 1} 完成 | 平均损失：{avg_loss:.4f} | 当前学习率：{current_lr:.6f}")
        # 生成可视化图表
        visualizer.plot_train_loss_curve(epoch)
        visualizer.plot_learning_rate_curve(epoch)
        # 保存checkpoint
        torch.save({
            "epoch": epoch,
            "model_state_dict": model.state_dict(),
            "optimizer_state_dict": optimizer.state_dict(),
            "loss": avg_loss
        }, os.path.join(output_dir, f"checkpoint_epoch_{epoch + 1}.pt"))
    # 生成最终总结图
    visualizer.plot_final_summary(num_train_epochs)
    # 保存最终模型
    torch.save(model.state_dict(), os.path.join(output_dir, "final_model.pt"))
    print(f"训练完成！")
    print(f"模型保存路径：{output_dir}")
    print(f"可视化图表保存路径：{vis_output_dir}")

参数名	类型	默认值	作用说明
`model`	`nn.Module`	-	待训练的`BertForPretraining`模型实例，包含完整的特征提取和预测头。
`train_dataset`	数据集对象	-	预处理后的训练数据集，需包含`input_ids`、`token_type_ids`、`attention_mask`、`mlm_labels`、`nsp_labels`等字段。
`batch_size`	`int`	16	每个训练批次的样本数，影响训练效率和内存占用（batch_size 越大，并行性越好，但内存需求越高）。
`num_train_epochs`	`int`	4	训练总轮次（完整遍历数据集的次数），轮次越多，模型可能拟合越好，但需避免过拟合。
`learning_rate`	`float`	2e-5	初始学习率（BERT 预训练的经典设置），通过调度器线性衰减至 0。
`output_dir`	`str`	"./bert_pretrained"	模型检查点和最终模型的保存路径。
`vis_output_dir`	`str`	"./bert_visualizations"	训练过程可视化图表（损失曲线、学习率曲线等）的保存路径。

函数按照 “准备阶段→训练循环→收尾阶段” 的流程组织，每个阶段包含关键训练策略：

1. 准备阶段：环境与组件配置

（1）设备配置（CPU/GPU 自动选择）

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)  # 将模型移动到目标设备

自动检测是否有 GPU 可用，优先使用 CUDA 加速训练（BERT 模型较大，GPU 训练是必要的）；
将模型参数移动到目标设备，确保计算在同一设备上进行（避免数据 / 模型设备不匹配错误）。

（2）可视化工具初始化

visualizer = BertTrainVisualizer(vis_output_dir)  # 自定义可视化工具

用于记录训练过程中的损失值和学习率，生成动态曲线，直观监控训练趋势（如是否收敛、是否过拟合）。

（3）数据加载器（批量处理数据）

train_dataloader = data.DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,  # 打乱数据顺序，增强训练随机性
    num_workers=0  # 单进程加载（多进程可能引发数据加载问题，视环境调整）
)

将数据集按batch_size拆分为多个批次，支持迭代访问；
shuffle=True确保每个 epoch 的样本顺序不同，避免模型学习数据顺序规律。

（4）优化器与学习率调度器

# 优化器：AdamW（带权重衰减的Adam，BERT预训练推荐）
optimizer = AdamW(model.parameters(), lr=learning_rate, eps=1e-8)
# 调度器：线性衰减学习率（从初始值到0，总步数=总批次）
total_steps = len(train_dataloader) * num_train_epochs
scheduler = LinearLR(optimizer, start_factor=1.0, end_factor=0.0, total_iters=total_steps)

AdamW：相比标准 Adam，增加权重衰减（L2 正则化），有效防止模型过拟合；
LinearLR：学习率随训练步数线性下降，前期用较大学习率快速收敛，后期用小学习率精细优化。

（5）混合精度训练配置（加速训练）

use_amp = device.type == "cuda"  # 仅GPU支持混合精度
if use_amp:
    # 根据PyTorch版本选择amp API（兼容新旧版本）
    if torch版本 >= 1.10:
        scaler = torch.amp.GradScaler(device_type=device.type)  # 梯度缩放器
        autocast_context = torch.amp.autocast(device_type=device.type)  # 自动混合精度上下文
    else:
        scaler = torch.cuda.amp.GradScaler()
        autocast_context = torch.cuda.amp.autocast()

混合精度训练：用 FP16（半精度）计算部分张量，减少内存占用和计算时间（比 FP32 快 2-3 倍），同时用GradScaler避免梯度下溢；
兼容处理：根据 PyTorch 版本选择对应的混合精度 API（torch.amp或torch.cuda.amp），消除版本兼容警告。

2. 训练循环：多轮迭代优化模型

（1）模型切换至训练模式

model.train()  # 启用dropout、BatchNorm训练模式

（2）多 epoch 迭代

for epoch in range(num_train_epochs):
    total_loss = 0.0
    progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch + 1}/{num_train_epochs}")  # 进度条
    for step, batch in enumerate(progress_bar):
        # 批量数据移动到设备
        batch = {k: v.to(device) for k, v in batch.items()}
        optimizer.zero_grad()  # 清零梯度（避免累积前一批次的梯度）
        # 前向传播（支持混合精度）
        if use_amp:
            with autocast_context:  # 自动切换精度
                loss, _, _ = model(
                    input_ids=batch["input_ids"],
                    token_type_ids=batch["token_type_ids"],
                    attention_mask=batch["attention_mask"],
                    mlm_labels=batch["mlm_labels"],
                    nsp_labels=batch["nsp_labels"]
                )
        else:
            loss, _, _ = model(...)  # CPU环境：纯FP32计算
        # 梯度累积（模拟大批次）
        loss_accum = loss / 2  # 每2步累积一次梯度
        loss_actual = loss.item() * 2  # 还原实际损失值
        total_loss += loss_accum.item()
        # 记录指标（损失、学习率）
        current_lr = optimizer.param_groups[0]["lr"]
        visualizer.record_step(loss_actual, current_lr)
        # 反向传播与参数更新
        if use_amp:
            scaler.scale(loss_accum).backward()  # 缩放损失，避免梯度下溢
            # 每2步或最后一步更新参数（梯度累积）
            if (step + 1) % 2 == 0 or step == len(train_dataloader) - 1:
                scaler.step(optimizer)  # 应用缩放后的梯度
                scaler.update()  # 更新缩放器状态
                scheduler.step()  # 更新学习率
        else:
            loss_accum.backward()
            if (step + 1) % 2 == 0 or step == len(train_dataloader) - 1:
                optimizer.step()
                scheduler.step()
        # 进度条显示实时损失和学习率
        progress_bar.set_postfix({"step_loss": f"{loss_actual:.4f}", "lr": f"{current_lr:.6f}"})

核心训练策略解析：

梯度累积：当 GPU 内存有限无法设置大batch_size时，每 2 步累积一次梯度再更新参数，等效于batch_size×2的训练效果（平衡内存与训练稳定性）；
混合精度前向传播：在autocast_context中计算损失，自动将部分操作转为 FP16，减少计算量；
梯度缩放：scaler放大损失值后再反向传播，避免 FP16 精度下梯度过小导致的 “下溢”（梯度变为 0），确保梯度有效更新。

3. 收尾阶段：每轮与最终处理

（1）每 epoch 结束后操作

# 计算本轮平均损失
avg_loss = (total_loss / len(train_dataloader)) * 2
print(f"Epoch {epoch + 1} 完成 | 平均损失：{avg_loss:.4f} | 当前学习率：{current_lr:.6f}")
# 生成可视化图表（损失曲线、学习率曲线）
visualizer.plot_train_loss_curve(epoch)
visualizer.plot_learning_rate_curve(epoch)
# 保存检查点（含模型、优化器状态，支持断点续训）
torch.save({
    "epoch": epoch,
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    "loss": avg_loss
}, os.path.join(output_dir, f"checkpoint_epoch_{epoch + 1}.pt"))

损失监控：计算每轮平均损失，判断模型是否收敛（损失持续下降且趋于稳定）；
可视化：实时生成损失和学习率曲线，直观发现训练问题（如损失波动大、学习率衰减异常）；
检查点保存：保存模型参数、优化器状态等，支持意外中断后从指定 epoch 继续训练。

（2）全部 epoch 结束后操作

# 生成最终总结图表（全周期损失、学习率趋势）
visualizer.plot_final_summary(num_train_epochs)
# 保存最终模型（仅模型参数，减小存储）
torch.save(model.state_dict(), os.path.join(output_dir, "final_model.pt"))

最终可视化：汇总所有 epoch 的训练指标，形成完整训练报告；
模型保存：输出最终优化后的模型参数，用于后续下游任务微调。

train_bert 是 BERT 预训练的 “一站式执行函数”，从环境准备到模型保存，覆盖了预训练全生命周期的核心需求。它通过整合高效训练策略（混合精度、梯度累积、学习率调度）和监控工具，确保模型能稳定学习到高质量的通用语义表示，为后续下游任务微调奠定基础。函数设计兼顾了易用性、效率和鲁棒性，是 BERT 预训练实践中的关键组件。

第2步：`BertTrainVisualizer` 类实现

BertTrainVisualizer 是 BERT 预训练过程的可视化工具类，专门用于记录、绘制和保存训练过程中的关键指标（损失值和学习率），并解决中文显示问题，帮助开发者直观监控模型训练状态（如是否收敛、学习率是否合理衰减等），是训练流程中不可或缺的 “监控仪表盘”。

在 BERT 预训练中，训练过程的可视化是判断模型是否正常学习的重要依据（例如：损失是否持续下降、学习率是否按预期衰减）。该类的核心作用是：

实时记录每一步训练的损失值和学习率；
生成损失曲线、学习率曲线及综合指标图；
解决 matplotlib 中文显示异常（方块 / 乱码）问题，确保图表中的中文标签正常显示。

通过可视化，开发者能快速发现训练问题（如损失波动过大、学习率衰减异常），及时调整训练策略。

class BertTrainVisualizer:
    def __init__(self, output_dir="./bert_visualizations"):
        self.output_dir = output_dir
        os.makedirs(self.output_dir, exist_ok=True)
        self.train_losses = []
        self.learning_rates = []
        # 关键修改：获取并强制使用Windows系统中文字体（直接指定字体文件路径）
        self.chinese_font_path = self._get_windows_chinese_font()
        # 生成matplotlib可用的字体属性（直接基于文件路径，不依赖字体名称）
        self.chinese_font = fm.FontProperties(fname=self.chinese_font_path)
        plt.rcParams["axes.unicode_minus"] = False  # 解决负号显示异常
        sns.set_style("whitegrid")
    def _get_windows_chinese_font(self):
        """
        强制获取Windows系统中的中文字体文件路径（优先级：黑体 > 微软雅黑 > 宋体）
        直接返回字体文件路径，避免字体名称匹配问题
        """
        # Windows系统默认字体路径（固定）
        font_candidates = [
            "C:/Windows/Fonts/simhei.ttf",  # 黑体（优先）
            "C:/Windows/Fonts/msyh.ttc",  # 微软雅黑（备选1）
            "C:/Windows/Fonts/msyhbd.ttc",  # 微软雅黑Bold（备选2）
            "C:/Windows/Fonts/simsun.ttc",  # 宋体（备选3）
            "C:/Windows/Fonts/simkai.ttf"  # 楷体（备选4）
        ]
        # 检查字体文件是否存在，返回第一个可用的
        for font_path in font_candidates:
            if os.path.exists(font_path):
                print(f"成功找到中文字体文件：{os.path.basename(font_path)}")
                print(f"字体路径：{font_path}")
                return font_path
        # 若所有字体都缺失，提示用户手动安装
        raise FileNotFoundError("""
        严重错误：未找到Windows系统默认中文字体！
        """)
    def record_step(self, loss, lr):
        self.train_losses.append(loss)
        self.learning_rates.append(lr)
    def plot_train_loss_curve(self, current_epoch):
        plt.figure(figsize=(10, 6))
        sns.lineplot(
            x=range(1, len(self.train_losses) + 1),
            y=self.train_losses,
            color="#2E86AB",
            linewidth=1.5,
            label=f"训练损失 (Epoch {current_epoch + 1})"
        )
        # 每个文本元素显式指定中文字体（避免全局设置失效）
        plt.xlabel("训练步数", fontsize=12, fontproperties=self.chinese_font)
        plt.ylabel("损失值", fontsize=12, fontproperties=self.chinese_font)
        plt.title(
            f"BERT预训练损失曲线 (截至Epoch {current_epoch + 1})",
            fontsize=14, pad=20,
            fontproperties=self.chinese_font
        )
        plt.legend(fontsize=10, prop=self.chinese_font)  # 图例也指定字体
        plt.grid(True, alpha=0.3)
        save_path = os.path.join(self.output_dir, f"train_loss_curve_epoch_{current_epoch + 1}.png")
        plt.tight_layout()
        plt.savefig(save_path, dpi=300)
        plt.close()
    def plot_learning_rate_curve(self, current_epoch):
        plt.figure(figsize=(10, 6))
        sns.lineplot(
            x=range(1, len(self.learning_rates) + 1),
            y=self.learning_rates,
            color="#A23B72",
            linewidth=1.5,
            label=f"学习率 (Epoch {current_epoch + 1})"
        )
        # 显式指定字体
        plt.xlabel("训练步数", fontsize=12, fontproperties=self.chinese_font)
        plt.ylabel("学习率", fontsize=12, fontproperties=self.chinese_font)
        plt.title(
            f"BERT预训练学习率变化曲线 (截至Epoch {current_epoch + 1})",
            fontsize=14, pad=20,
            fontproperties=self.chinese_font
        )
        plt.legend(fontsize=10, prop=self.chinese_font)
        plt.grid(True, alpha=0.3)
        save_path = os.path.join(self.output_dir, f"lr_curve_epoch_{current_epoch + 1}.png")
        plt.tight_layout()
        plt.savefig(save_path, dpi=300)
        plt.close()
    def plot_final_summary(self, num_total_epochs):
        fig, ax1 = plt.subplots(figsize=(12, 7))
        # 左轴：损失
        color1 = "#2E86AB"
        ax1.set_xlabel("训练步数", fontsize=12, fontproperties=self.chinese_font)
        ax1.set_ylabel("损失值", color=color1, fontsize=12, fontproperties=self.chinese_font)
        ax1.plot(
            range(1, len(self.train_losses) + 1),
            self.train_losses,
            color1,
            linewidth=1.5,
            label="训练损失"
        )
        ax1.tick_params(axis="y", labelcolor=color1)
        ax1.grid(True, alpha=0.3)
        # 右轴：学习率
        ax2 = ax1.twinx()
        color2 = "#A23B72"
        ax2.set_ylabel("学习率", color=color2, fontsize=12, fontproperties=self.chinese_font)
        ax2.plot(
            range(1, len(self.learning_rates) + 1),
            self.learning_rates,
            color2,
            linewidth=1.5,
            linestyle="--",
            label="学习率"
        )
        ax2.tick_params(axis="y", labelcolor=color2)
        ax2.set_yscale("log")
        # 合并图例（显式指定字体）
        lines1, labels1 = ax1.get_legend_handles_labels()
        lines2, labels2 = ax2.get_legend_handles_labels()
        ax1.legend(
            lines1 + lines2, labels1 + labels2,
            loc="upper right", fontsize=10,
            prop=self.chinese_font
        )
        plt.title(
            f"BERT预训练综合指标图（总Epoch：{num_total_epochs}）",
            fontsize=14, pad=20,
            fontproperties=self.chinese_font
        )
        plt.tight_layout()
        save_path = os.path.join(self.output_dir, "final_training_summary.png")
        plt.savefig(save_path, dpi=300)
        plt.close()

1.初始化方法 `init`：配置可视化环境

核心初始化逻辑：

目录创建：确保可视化图表有保存路径，避免保存时因目录不存在报错；
数据存储：初始化列表用于累积训练过程中的损失和学习率数据；
中文字体处理：通过_get_windows_chinese_font获取系统中文字体文件路径，解决 matplotlib 默认不支持中文的问题（核心痛点解决）；
图表样式配置：设置负号正常显示，并用 seaborn 美化图表风格。

2.中文字体处理方法 `_get_windows_chinese_font`

核心作用：

解决 matplotlib 中文显示乱码的经典问题：matplotlib 默认字体不支持中文，直接绘图会显示方块。该方法通过指定 Windows 系统自带的中文字体文件路径（而非字体名称），确保字体能被正确加载；
优先级设计：优先选择黑体（显示清晰），若不存在则依次尝试其他常见中文字体，提高兼容性；
错误处理：若所有字体均缺失，明确报错提示用户安装字体，避免绘图时隐性失败。

3.训练数据记录方法 `record_step`

作用：

每训练一步（处理一个 batch），调用该方法记录当前的损失值（loss）和学习率（lr），数据累积在列表中，为后续绘图提供原始数据；
是 “实时监控” 的基础：只有持续记录每步数据，才能绘制出完整的训练曲线。

4.单指标可视化方法（损失与学习率曲线）

（1）`plot_train_loss_curve`：绘制训练损失曲线

作用：

每完成一个 epoch，生成截至当前的损失曲线，直观展示损失随训练步数的变化趋势；
核心用途：判断模型是否收敛（若损失持续下降并趋于稳定，说明收敛正常；若波动过大或上升，可能存在过拟合或训练不稳定）。

（2）`plot_learning_rate_curve`：绘制学习率曲线

作用：

每完成一个 epoch，生成学习率随步数的变化曲线，验证学习率是否按预期策略衰减（如 BERT 中设置的线性衰减）；
核心用途：确认学习率调度器是否正常工作（若学习率未衰减或突变，可能导致训练失败）。

5.综合可视化方法 `plot_final_summary`

作用：

训练全部结束后，生成综合图表，在同一坐标系中用双轴展示损失（左轴）和学习率（右轴，对数刻度）的整体变化趋势；
核心价值：直观呈现损失与学习率的关联性（如学习率衰减时损失是否同步下降），帮助总结整个训练过程的有效性（如 “学习率线性衰减时，损失稳步下降，训练正常”）。

BertTrainVisualizer 是 BERT 预训练的 “可视化监控中心”，通过记录训练指标、生成直观图表、解决中文显示问题，为开发者提供了清晰的训练状态反馈。它的存在使得 “黑箱” 式的模型训练过程变得可观测、可分析，是确保 BERT 预训练顺利进行的重要工具。

（五）测试函数

第1步：main函数实现

main 函数是 BERT 模型预训练的总入口，负责串联起从分词器初始化、模型配置、数据集加载到最终启动训练的完整流程。它相当于一个 “指挥官”，协调各个组件（分词器、模型、数据集、训练函数）按序工作，最终实现 BERT 模型在指定数据集上的预训练。

main 函数是程序的执行起点，其核心作用是：

初始化预训练所需的核心组件（分词器、模型、数据集）；
配置训练相关参数（模型规模、批次大小、训练轮次等）；
验证关键组件的有效性（如分词器是否正常工作）；
调用训练函数（train_bert）启动预训练流程。

通过这个函数，将 “数据准备→模型构建→训练执行” 的离散步骤整合为一个可直接运行的完整流程。

def main():
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print("=" * 50)
    print("初始化BERT分词器...")
    tokenizer = BertManualTokenizer()
    # 验证分词器（确保无[UNK]）
    test_text = "Hello, world! This is a test of the Bert tokenizer."
    test_tokens = tokenizer.tokenize(test_text)
    unk_count = test_tokens.count(tokenizer.unk_token)
    print(f"测试文本分词结果：{test_tokens}")
    print(f"分词结果中[UNK]数量：{unk_count}")
    if unk_count > 0:
        print("警告：仍存在[UNK]，可能影响训练效果！")
    print("=" * 50)
    # 配置模型（轻量化，适配CPU/GPU）
    config = BertConfig(
        vocab_size=tokenizer.vocab_size,
        hidden_size=256,  # 缩小隐藏层，加快训练
        max_position_embeddings=128,  # 缩小序列长度
        num_hidden_layers=2,  # 减少层数
        num_heads=4,  # 减少注意力头数
        intermediate_size=1024,  # 缩小中间层
        dropout=0.1
    )
    print(f"模型配置：隐藏层={config.hidden_size}, 层数={config.num_hidden_layers}, 头数={config.num_heads}")
    # 创建数据集
    try:
        train_dataset = TextDataset(
            txt_file_path="Economic Globalization.txt",
            tokenizer=tokenizer,
            max_seq_len=128
        )
        print(f"数据集加载成功：共{len(train_dataset)}个样本")
    except Exception as e:
        print(f"数据集加载失败：{str(e)}")
        return
    # 创建模型并训练
    print("\n初始化BERT预训练模型...")
    bert_model = BertModel(config)
    model = BertForPretraining(bert_model)
    print(f"模型可训练参数：{sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
    # 适配设备设置batch_size（CPU用4，GPU用8）
    batch_size = 4 if device.type == "cpu" else 8
    print(f"\n训练参数：batch_size={batch_size}, epochs={3}, lr={2e-5}")
    train_bert(
        model=model,
        train_dataset=train_dataset,
        batch_size=batch_size,
        num_train_epochs=3,
        learning_rate=2e-5,
        output_dir="./custom_bert",
        vis_output_dir="./bert_visualizations"
    )

1. 设备初始化（CPU/GPU 自动选择）

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

作用：自动检测当前环境是否有可用 GPU，优先使用 CUDA 加速训练；若没有则使用 CPU（速度较慢，适合调试）。
意义：深度学习模型训练依赖硬件加速，这一步为后续组件（模型、数据）的设备分配提供基础。

2. 分词器初始化与验证

# 初始化自定义BERT分词器
print("初始化BERT分词器...")
tokenizer = BertManualTokenizer()
# 验证分词器（检查未登录词[UNK]数量）
test_text = "Hello, world! This is a test of the Bert tokenizer."
test_tokens = tokenizer.tokenize(test_text)
unk_count = test_tokens.count(tokenizer.unk_token)
print(f"测试文本分词结果：{test_tokens}")
print(f"分词结果中[UNK]数量：{unk_count}")
if unk_count > 0:
    print("警告：仍存在[UNK]，可能影响训练效果！")

作用：
- 初始化BertManualTokenizer（自定义分词器，负责将文本转换为模型可识别的 token ID）；
- 通过测试文本验证分词器效果：[UNK]（未登录词）数量过多会导致模型无法理解这些词，严重影响训练效果，因此需提前检测并预警。
意义：分词是 NLP 的基础步骤，分词器的质量直接决定输入模型的特征质量，这一步确保分词器能正常处理常见文本。

3. 模型轻量化配置（适配设备性能）

config = BertConfig(
    vocab_size=tokenizer.vocab_size,  # 与分词器词汇表大小一致
    hidden_size=256,  # 隐藏层维度（默认768，此处缩小以加快训练）
    max_position_embeddings=128,  # 最大序列长度（默认512，缩小适配短文本）
    num_hidden_layers=2,  # Transformer层数（默认12，减少层数降低复杂度）
    num_heads=4,  # 注意力头数（默认12，减少头数降低计算量）
    intermediate_size=1024,  # 中间层维度（默认3072，缩小以轻量化）
    dropout=0.1  # dropout概率（防止过拟合）
)
print(f"模型配置：隐藏层={config.hidden_size}, 层数={config.num_hidden_layers}, 头数={config.num_heads}")

作用：创建BertConfig对象，定义模型的核心参数，且采用 “轻量化” 设置（相比标准 BERT 大幅缩小各维度）。
意义：
- 适配有限的硬件资源（尤其是 CPU 或低端 GPU），减少内存占用和计算时间，使训练在普通设备上可行；
- 确保vocab_size与分词器一致（避免输入输出维度不匹配错误）。

4. 训练数据集加载

try:
    train_dataset = TextDataset(
        txt_file_path="Economic Globalization.txt",  # 训练文本文件路径
        tokenizer=tokenizer,  # 用于分词的工具
        max_seq_len=128  # 序列最大长度（与模型配置一致）
    )
    print(f"数据集加载成功：共{len(train_dataset)}个样本")
except Exception as e:
    print(f"数据集加载失败：{str(e)}")
    return

作用：从指定文本文件（Economic Globalization.txt）加载数据，通过TextDataset处理为模型可直接使用的训练样本（包含input_ids、token_type_ids等字段）。
异常处理：若文件不存在或处理失败，捕获异常并终止程序，避免后续训练因数据问题崩溃。
意义：数据是训练的基础，这一步将原始文本转换为模型可输入的结构化数据。

5. 模型创建与参数统计

# 初始化BERT基础模型和预训练模型
bert_model = BertModel(config)
model = BertForPretraining(bert_model)
# 统计可训练参数数量（直观展示模型规模）
print(f"模型可训练参数：{sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

作用：
- 基于配置创建BertModel（特征提取核心）和BertForPretraining（含预训练头的完整模型）；
- 计算并打印可训练参数数量（带千分位分隔符），让用户直观了解模型规模（参数越多，模型能力越强但训练成本越高）。

6. 训练参数配置与启动训练

# 根据设备类型设置合理的batch_size（CPU内存有限，用较小值）
batch_size = 4 if device.type == "cpu" else 8
print(f"\n训练参数：batch_size={batch_size}, epochs={3}, lr={2e-5}")
# 调用训练函数启动预训练
train_bert(
    model=model,
    train_dataset=train_dataset,
    batch_size=batch_size,
    num_train_epochs=3,
    learning_rate=2e-5,
    output_dir="./custom_bert",
    vis_output_dir="./bert_visualizations"
)

作用：
- 动态调整batch_size：CPU 内存有限，用较小的batch_size=4；GPU 内存较大，用batch_size=8，平衡训练效率和内存占用；
- 配置训练轮次（3 轮）、学习率（2e-5，BERT 经典设置）、模型保存路径和可视化路径；
- 调用train_bert函数，将所有组件（模型、数据集、参数）传入，正式启动预训练。

main 函数是 BERT 预训练的 “总调度中心”，通过协调分词器、模型、数据集和训练函数，实现了从原始文本到预训练模型的端到端流程。它兼顾了易用性（一键启动）、适应性（适配不同设备）和可靠性（组件验证与异常处理），是整个预训练 pipeline 中不可或缺的入口组件。

（六）运行代码

if __name__ == "__main__":
    main()

if __name__ == "__main__": main() 是 Python 程序中用于控制代码执行入口的经典语句，其核心功能是确保 main() 函数仅在当前脚本被直接运行时执行，而在脚本被作为模块导入到其他文件中时不执行。结合前面的代码上下文，这句话是整个 BERT 预训练程序的 “启动开关”。

具体作用解析

__name__ 变量的含义在 Python 中，每个模块（.py 文件）都有一个内置变量 __name__：
- 当该模块被直接运行时（例如通过 python script.py 命令），__name__ 会被自动赋值为 "__main__"；
- 当该模块被作为模块导入到其他文件中时（例如 import script），__name__ 会被赋值为模块的文件名（即 "script"）。
条件判断的作用if __name__ == "__main__": 这句话通过判断 __name__ 的值，决定是否执行后续代码：
- 若脚本被直接运行（__name__ == "__main__"），则执行 main() 函数，启动整个 BERT 预训练流程（包括分词器初始化、模型配置、数据加载、训练启动等）；
- 若脚本被作为模块导入（__name__ != "__main__"），则不执行 main() 函数，避免导入时意外触发预训练流程（例如其他文件可能只想复用脚本中的类或函数，而非运行完整训练）。
在当前代码中的意义结合前面的 main() 函数（负责串联 BERT 预训练的全流程），这句话的作用是：
- 作为程序的 “启动入口”：当用户直接运行该脚本时（例如 python bert_pretrain.py），自动触发 main() 函数，启动从数据准备到模型训练的完整流程；
- 保证模块复用的安全性：若其他脚本需要导入当前文件中的 BertModel、train_bert 等组件时，不会因导入操作而意外执行预训练（避免资源浪费或逻辑冲突）。

if __name__ == "__main__": main() 是 Python 中控制程序执行流程的 “开关”，在当前 BERT 预训练代码中，它确保了 main() 函数（即整个预训练流程）仅在脚本被直接运行时启动，既方便用户一键执行训练，又保证了代码作为模块被复用时的灵活性和安全性。

四、BERT模型的Python代码完整实现

import torch
import torch.nn as nn
import torch.utils.data as data
from torch.optim import AdamW
from torch.optim.lr_scheduler import LinearLR
import random
import os
import re
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import matplotlib.font_manager as fm  # 字体管理工具
def create_mlm_mask(input_ids, vocab_size, mask_token_id=103, pad_token_id=0):
    labels = input_ids.clone()
    probability_matrix = torch.full(labels.shape, 0.15)
    special_tokens_mask = (input_ids == pad_token_id)
    probability_matrix.masked_fill_(special_tokens_mask, value=0.0)
    masked_indices = torch.bernoulli(probability_matrix).bool()
    labels[~masked_indices] = -100
    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
    input_ids[indices_replaced] = mask_token_id
    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
    random_words = torch.randint(vocab_size, labels.shape, dtype=torch.long)
    input_ids[indices_random] = random_words[indices_random]
    return input_ids, labels
class TextProcessor:
    def __init__(self, tokenizer, max_seq_len=512):
        self.tokenizer = tokenizer
        self.max_seq_len = max_seq_len
        self.cls_token = tokenizer.cls_token
        self.sep_token = tokenizer.sep_token
        self.pad_token = tokenizer.pad_token
    def split_long_text(self, text, chunk_size=256, overlap=50):
        tokens = self.tokenizer.tokenize(text)
        chunks = []
        start = 0
        while start < len(tokens):
            end = start + chunk_size
            chunk_tokens = tokens[start:end]
            chunks.append(chunk_tokens)
            start = end - overlap
        return chunks
    def create_sentence_pairs(self, chunks, prob_next=0.5):
        pairs = []
        for i in range(len(chunks) - 1):
            if random.random() < prob_next:
                sentence1 = chunks[i]
                sentence2 = chunks[i + 1]
                label = 0
            else:
                sentence1 = chunks[i]
                rand_idx = random.randint(0, len(chunks) - 1)
                while rand_idx == i or rand_idx == i + 1:
                    rand_idx = random.randint(0, len(chunks) - 1)
                sentence2 = chunks[rand_idx]
                label = 1
            pairs.append((sentence1, sentence2, label))
        return pairs
    def tokenize_pair(self, sentence1, sentence2):
        tokens = [self.cls_token] + sentence1 + [self.sep_token] + sentence2 + [self.sep_token]
        token_type_ids = [0] * (len(sentence1) + 2) + [1] * (len(sentence2) + 1)
        if len(tokens) > self.max_seq_len:
            tokens = tokens[:self.max_seq_len]
            token_type_ids = token_type_ids[:self.max_seq_len]
        input_ids = self.tokenizer.convert_tokens_to_ids(tokens)
        attention_mask = [1] * len(input_ids)
        padding_length = self.max_seq_len - len(input_ids)
        input_ids += [self.tokenizer.pad_token_id] * padding_length
        attention_mask += [0] * padding_length
        token_type_ids += [0] * padding_length
        return {
            "input_ids": torch.tensor(input_ids, dtype=torch.long),
            "attention_mask": torch.tensor(attention_mask, dtype=torch.long),
            "token_type_ids": torch.tensor(token_type_ids, dtype=torch.long)
        }
class TextDataset(data.Dataset):
    def __init__(self, txt_file_path, tokenizer, max_seq_len=512, min_chunk_len=50):
        self.txt_file_path = txt_file_path
        self.tokenizer = tokenizer
        self.processor = TextProcessor(tokenizer, max_seq_len)
        self.min_chunk_len = min_chunk_len
        self.data = self._load_and_preprocess()
    def _load_and_preprocess(self):
        try:
            with open(self.txt_file_path, 'r', encoding='utf-8') as f:
                text = f.read().replace('\n', ' ').strip()
        except FileNotFoundError:
            print(f"未找到 {self.txt_file_path}，使用内置示例文本进行训练")
            text = """
            Economic globalization refers to the increasing interdependence of world economies
            as a result of the growing scale of cross-border trade of commodities and services,
            flow of international capital and wide and rapid spread of technologies.
            It reflects the continuing expansion and mutual integration of market frontiers,
            and is an irreversible trend for the economic development in the whole world at the turn of the millennium.
            The rapid growing significance of information in all types of productive activities
            and marketization are the two major driving forces for economic globalization.
            In other words, the fast globalization of the world’s economies in recent years
            is largely based on the rapid development of science and technologies,
            has resulted from the environment in which market economic system has been fast spreading throughout the world.
            """
        chunks = self.processor.split_long_text(text)
        chunks = [c for c in chunks if len(c) >= self.min_chunk_len]
        if len(chunks) < 2:
            raise ValueError("文本过短，无法生成足够的句子对")
        pairs = self.processor.create_sentence_pairs(chunks)
        samples = []
        for s1, s2, nsp_label in pairs:
            features = self.processor.tokenize_pair(s1, s2)
            features["nsp_label"] = torch.tensor(nsp_label, dtype=torch.long)
            samples.append(features)
        return samples
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        sample = self.data[idx]
        input_ids = sample["input_ids"].clone()
        input_ids, mlm_labels = create_mlm_mask(
            input_ids,
            vocab_size=self.tokenizer.vocab_size,
            mask_token_id=self.tokenizer.mask_token_id,
            pad_token_id=self.tokenizer.pad_token_id
        )
        return {
            "input_ids": input_ids,
            "attention_mask": sample["attention_mask"],
            "token_type_ids": sample["token_type_ids"],
            "mlm_labels": mlm_labels,
            "nsp_labels": sample["nsp_label"]
        }
class BertManualTokenizer:
    def __init__(self):
        self.vocab = self.create_basic_vocab()
        self.ids_to_tokens = {v: k for k, v in self.vocab.items()}
        # 特殊符号定义
        self.unk_token = "[UNK]"
        self.cls_token = "[CLS]"
        self.sep_token = "[SEP]"
        self.mask_token = "[MASK]"
        self.pad_token = "[PAD]"
        # 特殊符号ID
        self.unk_token_id = self.vocab[self.unk_token]
        self.cls_token_id = self.vocab[self.cls_token]
        self.sep_token_id = self.vocab[self.sep_token]
        self.mask_token_id = self.vocab[self.mask_token]
        self.pad_token_id = self.vocab[self.pad_token]
        self.basic_tokenizer = self._create_basic_tokenizer()
    def create_basic_vocab(self):
        """扩展词汇表，确保测试单词和子词覆盖"""
        vocab = {}
        # 1. 特殊符号
        special_tokens = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
        for token in special_tokens:
            vocab[token] = len(vocab)
        # 2. 单个字母（小写，确保子词拆分兜底）
        for c in "abcdefghijklmnopqrstuvwxyz":
            vocab[c] = len(vocab)
        # 3. 常用单词（覆盖测试文本和示例文本）
        common_words = [
            "hello", "world", "this", "is", "a", "test", "of", "the", "bert", "tokenizer",
            "economic", "globalization", "refers", "to", "increasing", "interdependence",
            "world", "economies", "result", "growing", "scale", "cross-border", "trade",
            "commodities", "services", "flow", "international", "capital", "wide", "rapid",
            "spread", "technologies", "reflects", "continuing", "expansion", "mutual",
            "integration", "market", "frontiers", "irreversible", "trend", "development",
            "millennium", "rapid", "growing", "significance", "information", "productive",
            "activities", "marketization", "major", "driving", "forces", "other", "words",
            "fast", "largely", "based", "science", "technologies", "resulted", "environment",
            "system", "spreading", "throughout"
        ]
        for word in list(set(common_words)):
            if word not in vocab:
                vocab[word] = len(vocab)
        # 4. 常用子词（带##前缀，覆盖单词后缀）
        subwords = ["##s", "##ing", "##ed", "##ly", "##er", "##est", "##o", "##r", "##ld",
                    "##tion", "##ment", "##ity", "##ive", "##ize", "##al", "##able", "##ible"]
        for subword in subwords:
            if subword not in vocab:
                vocab[subword] = len(vocab)
        return vocab
    def _create_basic_tokenizer(self):
        """简化正则，拆分单词、标点"""
        pattern = r"(\w+|[^\w\s])"
        return re.compile(pattern)
    def basic_tokenize(self, text):
        """基础分词：小写+拆分"""
        text = text.lower().strip()
        return self.basic_tokenizer.findall(text)
    def wordpiece_tokenize(self, token):
        """WordPiece分词+兜底（拆分为单个字母）"""
        sub_tokens = []
        start = 0
        n = len(token)
        while start < n:
            end = n
            found = False
            while end > start:
                substr = token[start:end]
                if start > 0:
                    substr = f"##{substr}"
                if substr in self.vocab:
                    sub_tokens.append(substr)
                    start = end
                    found = True
                    break
                end -= 1
            if not found:
                sub_tokens.append(token[start])  # 兜底：单个字母（已在词汇表）
                start += 1
        return sub_tokens
    def tokenize(self, text):
        """完整分词流程"""
        tokens = []
        for basic_token in self.basic_tokenize(text):
            tokens.extend(self.wordpiece_tokenize(basic_token))
        return tokens
    def convert_tokens_to_ids(self, tokens):
        return [self.vocab.get(token, self.unk_token_id) for token in tokens]
    def convert_ids_to_tokens(self, ids):
        return [self.ids_to_tokens.get(id, self.unk_token) for id in ids]
    @property
    def vocab_size(self):
        return len(self.vocab)
class BertConfig:
    def __init__(self, vocab_size=30522, hidden_size=768, max_position_embeddings=512,
                 type_vocab_size=2, num_heads=12, intermediate_size=3072,
                 num_hidden_layers=12, dropout=0.1):
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.max_position_embeddings = max_position_embeddings
        self.type_vocab_size = type_vocab_size
        self.num_heads = num_heads
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.dropout = dropout
class BertEmbeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(config.dropout)
        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
    def forward(self, input_ids, token_type_ids=None):
        position_ids = self.position_ids[:, :input_ids.size(1)]
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)
        embeddings = self.word_embeddings(input_ids) + self.position_embeddings(
            position_ids) + self.token_type_embeddings(token_type_ids)
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings
class BertSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.num_heads = config.num_heads
        self.hidden_size = config.hidden_size
        self.head_size = config.hidden_size // config.num_heads
        assert self.head_size * self.num_heads == self.hidden_size, "hidden_size must be divisible by num_heads"
        self.query = nn.Linear(config.hidden_size, config.hidden_size)
        self.key = nn.Linear(config.hidden_size, config.hidden_size)
        self.value = nn.Linear(config.hidden_size, config.hidden_size)
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.dropout)
    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_heads, self.head_size)
        x = x.view(new_x_shape)
        return x.permute(0, 2, 1, 3)
    def forward(self, hidden_states, attention_mask=None):
        mixed_query_layer = self.query(hidden_states)
        mixed_key_layer = self.key(hidden_states)
        mixed_value_layer = self.value(hidden_states)
        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        attention_scores = attention_scores / torch.sqrt(torch.tensor(self.head_size, dtype=torch.float32))
        if attention_mask is not None:
            attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
            attention_scores = attention_scores + (attention_mask * -10000.0)
        attention_probs = nn.Softmax(dim=-1)(attention_scores)
        attention_probs = self.dropout(attention_probs)
        context_layer = torch.matmul(attention_probs, value_layer)
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.hidden_size,)
        context_layer = context_layer.view(new_context_layer_shape)
        output = self.dense(context_layer)
        return output
class BertLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attention = BertSelfAttention(config)
        self.attention_output_layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.attention_dropout = nn.Dropout(config.dropout)
        self.intermediate = nn.Linear(config.hidden_size, config.intermediate_size)
        self.intermediate_act_fn = nn.GELU()
        self.output = nn.Linear(config.intermediate_size, config.hidden_size)
        self.output_layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.output_dropout = nn.Dropout(config.dropout)
    def forward(self, hidden_states, attention_mask=None):
        attention_output = self.attention(hidden_states, attention_mask)
        attention_output = self.attention_dropout(attention_output)
        attention_output = self.attention_output_layer_norm(hidden_states + attention_output)
        intermediate_output = self.intermediate(attention_output)
        intermediate_output = self.intermediate_act_fn(intermediate_output)
        layer_output = self.output(intermediate_output)
        layer_output = self.output_dropout(layer_output)
        layer_output = self.output_layer_norm(attention_output + layer_output)
        return layer_output
class BertEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])
    def forward(self, hidden_states, attention_mask=None):
        for layer_module in self.layer:
            hidden_states = layer_module(hidden_states, attention_mask)
        return hidden_states
class BertModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.embeddings = BertEmbeddings(config)
        self.encoder = BertEncoder(config)
        self.pooler = nn.Linear(config.hidden_size, config.hidden_size)
        self.pooler_activation = nn.Tanh()
    def forward(self, input_ids, token_type_ids=None, attention_mask=None):
        embedding_output = self.embeddings(input_ids, token_type_ids)
        sequence_output = self.encoder(embedding_output, attention_mask)
        pooled_output = self.pooler(sequence_output[:, 0, :])
        pooled_output = self.pooler_activation(pooled_output)
        return sequence_output, pooled_output
class BertPretrainingHeads(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.activation = nn.GELU()
        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.mlm_head = nn.Linear(config.hidden_size, config.vocab_size)
        self.nsp_head = nn.Linear(config.hidden_size, 2)
    def forward(self, sequence_output, pooled_output):
        mlm_hidden = self.dense(sequence_output)
        mlm_hidden = self.activation(mlm_hidden)
        mlm_hidden = self.LayerNorm(mlm_hidden)
        mlm_logits = self.mlm_head(mlm_hidden)
        nsp_logits = self.nsp_head(pooled_output)
        return mlm_logits, nsp_logits
class BertForPretraining(nn.Module):
    def __init__(self, bert_model):
        super().__init__()
        self.bert = bert_model
        self.cls = BertPretrainingHeads(bert_model.config)
    def forward(self, input_ids, token_type_ids=None, attention_mask=None, mlm_labels=None, nsp_labels=None):
        sequence_output, pooled_output = self.bert(input_ids, token_type_ids, attention_mask)
        mlm_logits, nsp_logits = self.cls(sequence_output, pooled_output)
        total_loss = None
        if mlm_labels is not None and nsp_labels is not None:
            mlm_loss_fct = nn.CrossEntropyLoss()
            nsp_loss_fct = nn.CrossEntropyLoss()
            mlm_loss = mlm_loss_fct(mlm_logits.view(-1, self.bert.config.vocab_size), mlm_labels.view(-1))
            nsp_loss = nsp_loss_fct(nsp_logits.view(-1, 2), nsp_labels.view(-1))
            total_loss = mlm_loss + nsp_loss
        return total_loss, mlm_logits, nsp_logits
def train_bert(model, train_dataset, batch_size=16, num_train_epochs=4, learning_rate=2e-5,
               output_dir="./bert_pretrained", vis_output_dir="./bert_visualizations"):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"使用设备：{device}")
    model.to(device)
    # 初始化可视化工具（自动处理中文字体）
    visualizer = BertTrainVisualizer(vis_output_dir)
    # 数据加载器
    train_dataloader = data.DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=0
    )
    print(f"数据加载器：共{len(train_dataloader)}个batch")
    # 优化器与调度器
    optimizer = AdamW(model.parameters(), lr=learning_rate, eps=1e-8)
    total_steps = len(train_dataloader) * num_train_epochs
    scheduler = LinearLR(optimizer, start_factor=1.0, end_factor=0.0, total_iters=total_steps)
    # 适配PyTorch新旧版本的混合精度API（消除FutureWarning）
    use_amp = device.type == "cuda"
    scaler, autocast_context = None, None
    if use_amp:
        # 检查PyTorch版本：>=2.0使用torch.amp，否则使用torch.cuda.amp
        torch_version = torch.__version__.split(".")
        major, minor = int(torch_version[0]), int(torch_version[1])
        if major >= 2 or (major == 1 and minor >= 10):
            # 新版本：使用torch.amp（推荐）
            scaler = torch.amp.GradScaler(device_type=device.type)
            autocast_context = torch.amp.autocast(device_type=device.type)
            print("启用混合精度训练（torch.amp）")
        else:
            # 旧版本：使用torch.cuda.amp（兼容）
            scaler = torch.cuda.amp.GradScaler()
            autocast_context = torch.cuda.amp.autocast()
            print("启用混合精度训练（torch.cuda.amp）")
    else:
        print("CPU环境不支持混合精度，禁用")
    # 创建模型输出目录
    os.makedirs(output_dir, exist_ok=True)
    model.train()
    for epoch in range(num_train_epochs):
        total_loss = 0.0
        progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch + 1}/{num_train_epochs}")
        for step, batch in enumerate(progress_bar):
            batch = {k: v.to(device) for k, v in batch.items()}
            optimizer.zero_grad()  # 提前清零梯度，避免累积错误
            # 混合精度训练上下文（兼容新旧API）
            if use_amp:
                with autocast_context:
                    loss, _, _ = model(
                        input_ids=batch["input_ids"],
                        token_type_ids=batch["token_type_ids"],
                        attention_mask=batch["attention_mask"],
                        mlm_labels=batch["mlm_labels"],
                        nsp_labels=batch["nsp_labels"]
                    )
            else:
                # CPU环境：不使用混合精度
                loss, _, _ = model(
                    input_ids=batch["input_ids"],
                    token_type_ids=batch["token_type_ids"],
                    attention_mask=batch["attention_mask"],
                    mlm_labels=batch["mlm_labels"],
                    nsp_labels=batch["nsp_labels"]
                )
            # 梯度累积（保持原有逻辑）
            loss_accum = loss / 2
            loss_actual = loss.item() * 2
            total_loss += loss_accum.item()
            # 记录指标
            current_lr = optimizer.param_groups[0]["lr"]
            visualizer.record_step(loss_actual, current_lr)
            # 反向传播与梯度更新
            if use_amp:
                scaler.scale(loss_accum).backward()
                # 梯度累积步数：每2步更新一次
                if (step + 1) % 2 == 0 or step == len(train_dataloader) - 1:
                    scaler.step(optimizer)
                    scaler.update()
                    scheduler.step()
            else:
                loss_accum.backward()
                if (step + 1) % 2 == 0 or step == len(train_dataloader) - 1:
                    optimizer.step()
                    scheduler.step()
            # 更新进度条
            progress_bar.set_postfix({"step_loss": f"{loss_actual:.4f}", "lr": f"{current_lr:.6f}"})
        # 计算平均损失
        avg_loss = (total_loss / len(train_dataloader)) * 2
        print(f"Epoch {epoch + 1} 完成 | 平均损失：{avg_loss:.4f} | 当前学习率：{current_lr:.6f}")
        # 生成可视化图表
        visualizer.plot_train_loss_curve(epoch)
        visualizer.plot_learning_rate_curve(epoch)
        # 保存checkpoint
        torch.save({
            "epoch": epoch,
            "model_state_dict": model.state_dict(),
            "optimizer_state_dict": optimizer.state_dict(),
            "loss": avg_loss
        }, os.path.join(output_dir, f"checkpoint_epoch_{epoch + 1}.pt"))
    # 生成最终总结图
    visualizer.plot_final_summary(num_train_epochs)
    # 保存最终模型
    torch.save(model.state_dict(), os.path.join(output_dir, "final_model.pt"))
    print(f"训练完成！")
    print(f"模型保存路径：{output_dir}")
    print(f"可视化图表保存路径：{vis_output_dir}")
class BertTrainVisualizer:
    def __init__(self, output_dir="./bert_visualizations"):
        self.output_dir = output_dir
        os.makedirs(self.output_dir, exist_ok=True)
        self.train_losses = []
        self.learning_rates = []
        # 关键修改：获取并强制使用Windows系统中文字体（直接指定字体文件路径）
        self.chinese_font_path = self._get_windows_chinese_font()
        # 生成matplotlib可用的字体属性（直接基于文件路径，不依赖字体名称）
        self.chinese_font = fm.FontProperties(fname=self.chinese_font_path)
        plt.rcParams["axes.unicode_minus"] = False  # 解决负号显示异常
        sns.set_style("whitegrid")
    def _get_windows_chinese_font(self):
        """
        强制获取Windows系统中的中文字体文件路径（优先级：黑体 > 微软雅黑 > 宋体）
        直接返回字体文件路径，避免字体名称匹配问题
        """
        # Windows系统默认字体路径（固定）
        font_candidates = [
            "C:/Windows/Fonts/simhei.ttf",  # 黑体（优先）
            "C:/Windows/Fonts/msyh.ttc",  # 微软雅黑（备选1）
            "C:/Windows/Fonts/msyhbd.ttc",  # 微软雅黑Bold（备选2）
            "C:/Windows/Fonts/simsun.ttc",  # 宋体（备选3）
            "C:/Windows/Fonts/simkai.ttf"  # 楷体（备选4）
        ]
        # 检查字体文件是否存在，返回第一个可用的
        for font_path in font_candidates:
            if os.path.exists(font_path):
                print(f"成功找到中文字体文件：{os.path.basename(font_path)}")
                print(f"字体路径：{font_path}")
                return font_path
        # 若所有字体都缺失，提示用户手动安装
        raise FileNotFoundError("""
        严重错误：未找到Windows系统默认中文字体！
        """)
    def record_step(self, loss, lr):
        self.train_losses.append(loss)
        self.learning_rates.append(lr)
    def plot_train_loss_curve(self, current_epoch):
        plt.figure(figsize=(10, 6))
        sns.lineplot(
            x=range(1, len(self.train_losses) + 1),
            y=self.train_losses,
            color="#2E86AB",
            linewidth=1.5,
            label=f"训练损失 (Epoch {current_epoch + 1})"
        )
        # 每个文本元素显式指定中文字体（避免全局设置失效）
        plt.xlabel("训练步数", fontsize=12, fontproperties=self.chinese_font)
        plt.ylabel("损失值", fontsize=12, fontproperties=self.chinese_font)
        plt.title(
            f"BERT预训练损失曲线 (截至Epoch {current_epoch + 1})",
            fontsize=14, pad=20,
            fontproperties=self.chinese_font
        )
        plt.legend(fontsize=10, prop=self.chinese_font)  # 图例也指定字体
        plt.grid(True, alpha=0.3)
        save_path = os.path.join(self.output_dir, f"train_loss_curve_epoch_{current_epoch + 1}.png")
        plt.tight_layout()
        plt.savefig(save_path, dpi=300)
        plt.close()
    def plot_learning_rate_curve(self, current_epoch):
        plt.figure(figsize=(10, 6))
        sns.lineplot(
            x=range(1, len(self.learning_rates) + 1),
            y=self.learning_rates,
            color="#A23B72",
            linewidth=1.5,
            label=f"学习率 (Epoch {current_epoch + 1})"
        )
        # 显式指定字体
        plt.xlabel("训练步数", fontsize=12, fontproperties=self.chinese_font)
        plt.ylabel("学习率", fontsize=12, fontproperties=self.chinese_font)
        plt.title(
            f"BERT预训练学习率变化曲线 (截至Epoch {current_epoch + 1})",
            fontsize=14, pad=20,
            fontproperties=self.chinese_font
        )
        plt.legend(fontsize=10, prop=self.chinese_font)
        plt.grid(True, alpha=0.3)
        save_path = os.path.join(self.output_dir, f"lr_curve_epoch_{current_epoch + 1}.png")
        plt.tight_layout()
        plt.savefig(save_path, dpi=300)
        plt.close()
    def plot_final_summary(self, num_total_epochs):
        fig, ax1 = plt.subplots(figsize=(12, 7))
        # 左轴：损失
        color1 = "#2E86AB"
        ax1.set_xlabel("训练步数", fontsize=12, fontproperties=self.chinese_font)
        ax1.set_ylabel("损失值", color=color1, fontsize=12, fontproperties=self.chinese_font)
        ax1.plot(
            range(1, len(self.train_losses) + 1),
            self.train_losses,
            color1,
            linewidth=1.5,
            label="训练损失"
        )
        ax1.tick_params(axis="y", labelcolor=color1)
        ax1.grid(True, alpha=0.3)
        # 右轴：学习率
        ax2 = ax1.twinx()
        color2 = "#A23B72"
        ax2.set_ylabel("学习率", color=color2, fontsize=12, fontproperties=self.chinese_font)
        ax2.plot(
            range(1, len(self.learning_rates) + 1),
            self.learning_rates,
            color2,
            linewidth=1.5,
            linestyle="--",
            label="学习率"
        )
        ax2.tick_params(axis="y", labelcolor=color2)
        ax2.set_yscale("log")
        # 合并图例（显式指定字体）
        lines1, labels1 = ax1.get_legend_handles_labels()
        lines2, labels2 = ax2.get_legend_handles_labels()
        ax1.legend(
            lines1 + lines2, labels1 + labels2,
            loc="upper right", fontsize=10,
            prop=self.chinese_font
        )
        plt.title(
            f"BERT预训练综合指标图（总Epoch：{num_total_epochs}）",
            fontsize=14, pad=20,
            fontproperties=self.chinese_font
        )
        plt.tight_layout()
        save_path = os.path.join(self.output_dir, "final_training_summary.png")
        plt.savefig(save_path, dpi=300)
        plt.close()
def main():
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print("=" * 50)
    print("初始化BERT分词器...")
    tokenizer = BertManualTokenizer()
    # 验证分词器（确保无[UNK]）
    test_text = "Hello, world! This is a test of the Bert tokenizer."
    test_tokens = tokenizer.tokenize(test_text)
    unk_count = test_tokens.count(tokenizer.unk_token)
    print(f"测试文本分词结果：{test_tokens}")
    print(f"分词结果中[UNK]数量：{unk_count}")
    if unk_count > 0:
        print("警告：仍存在[UNK]，可能影响训练效果！")
    print("=" * 50)
    # 配置模型（轻量化，适配CPU/GPU）
    config = BertConfig(
        vocab_size=tokenizer.vocab_size,
        hidden_size=256,  # 缩小隐藏层，加快训练
        max_position_embeddings=128,  # 缩小序列长度
        num_hidden_layers=2,  # 减少层数
        num_heads=4,  # 减少注意力头数
        intermediate_size=1024,  # 缩小中间层
        dropout=0.1
    )
    print(f"模型配置：隐藏层={config.hidden_size}, 层数={config.num_hidden_layers}, 头数={config.num_heads}")
    # 创建数据集
    try:
        train_dataset = TextDataset(
            txt_file_path="Economic Globalization.txt",
            tokenizer=tokenizer,
            max_seq_len=128
        )
        print(f"数据集加载成功：共{len(train_dataset)}个样本")
    except Exception as e:
        print(f"数据集加载失败：{str(e)}")
        return
    # 创建模型并训练
    print("\n初始化BERT预训练模型...")
    bert_model = BertModel(config)
    model = BertForPretraining(bert_model)
    print(f"模型可训练参数：{sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
    # 适配设备设置batch_size（CPU用4，GPU用8）
    batch_size = 4 if device.type == "cpu" else 8
    print(f"\n训练参数：batch_size={batch_size}, epochs={3}, lr={2e-5}")
    train_bert(
        model=model,
        train_dataset=train_dataset,
        batch_size=batch_size,
        num_train_epochs=3,
        learning_rate=2e-5,
        output_dir="./custom_bert",
        vis_output_dir="./bert_visualizations"
    )
if __name__ == "__main__":
    main()

五、训练语料库示例

本文使用如下语料库示例，在中，也可以替换成其他英文文本。

Economic globalization refers to the increasing interdependence of world economies through
the cross-border flow of goods, services, technology, capital, and labor. It is not a new phenomenon but has accelerated dramatically
over the past century, reshaping societies, economies, and cultures across the globe. This process has been driven
by a complex interplay of technological advancements, policy shifts, and evolving economic systems, each contributing
to the interconnected world we live in today. To understand economic globalization fully, we must examine its
historical roots, key drivers, multifaceted impacts, and the challenges it presents to nations and communities
worldwide.
The origins of economic globalization can be traced back to ancient trade routes, such as the Silk Road, which
connected distant civilizations through the exchange of spices, textiles, and ideas. However, the modern form of
globalization began to take shape during the 19th century, fueled by the Industrial Revolution. Innovations in
transportation—including steamships and railroads—reduced the cost of moving goods across long distances, while
advancements in communication, such as the telegraph, enabled faster exchange of information. During this era,
European powers expanded their colonial empires, creating global networks of resource extraction and trade that laid
the groundwork for future economic integration. By the late 19th century, the world had seen a surge in international
trade, with goods like cotton, rubber, and metals flowing across continents to feed industrial demand in Europe and
North America.
The early 20th century brought significant disruptions to globalization, including two world wars and the Great
Depression. These crises led to a rise in protectionist policies, as nations imposed high tariffs and trade barriers
to shield their economies from external shocks. For much of the mid-20th century, the world remained divided by
geopolitical tensions, particularly during the Cold War, which created separate economic blocs in the East and West.
However, the end of World War II also sowed the seeds for a new era of globalization. In 1944, representatives from
44 nations gathered in Bretton Woods, New Hampshire, to establish a framework for post-war economic cooperation.
This meeting resulted in the creation of institutions like the International Monetary Fund (IMF) and the World Bank,
designed to stabilize global financial markets and provide loans for reconstruction and development. The General
Agreement on Tariffs and Trade (GATT), established in 1947, further promoted free trade by reducing tariffs through
multilateral negotiations.
The collapse of the Soviet Union in 1991 marked a turning point in economic globalization, as former communist
countries began to integrate into the global economy. This period saw a wave of liberalization, with nations across
Asia, Africa, and Latin America adopting market-oriented reforms, privatizing state-owned enterprises, and opening
their borders to foreign investment. Concurrently, rapid advancements in technology—particularly the internet and
digital communication—revolutionized how businesses operate. The internet enabled instant communication across
borders, allowing companies to manage global supply chains more efficiently and reach customers worldwide. Meanwhile,
breakthroughs in transportation, such as containerization, reduced shipping costs and made it feasible to produce
goods in one country and sell them in another halfway across the world.
One of the most significant drivers of economic globalization has been the rise of multinational corporations (MNCs).
These large enterprises operate in multiple countries, with production facilities, offices, and markets spread across
continents. MNCs seek to maximize profits by leveraging differences in labor costs, resource availability, and
regulatory environments. For example, a company might design a product in the United States, source raw materials
from Africa, assemble components in China, and sell the final product in Europe. This global division of labor allows
firms to reduce costs and increase efficiency, but it also ties economies together, making them vulnerable to
disruptions in any part of the supply chain. Today, MNCs play a dominant role in the global economy, with many
generating revenues larger than the GDP of small nations.
International trade has been a cornerstone of economic globalization, with the volume of global trade growing
exponentially since the 1990s. The World Trade Organization (WTO), established in 1995 to replace GATT, has played a
key role in this expansion by enforcing trade rules, resolving disputes, and negotiating new agreements to reduce
barriers. Regional trade blocs, such as the European Union (EU), the North American Free Trade Agreement (NAFTA,
later replaced by USMCA), and the Association of Southeast Asian Nations (ASEAN), have further integrated markets by
eliminating tariffs and harmonizing regulations among member states. These agreements have facilitated the flow of
goods and services, allowing countries to specialize in the production of goods they can produce most efficiently—a
concept known as comparative advantage. For instance, countries with abundant agricultural land focus on farming,
while those with skilled labor forces specialize in technology and manufacturing.
Financial globalization has also accelerated in recent decades, with capital flowing more freely across borders than
ever before. Advances in financial technology have made it easier for investors to buy stocks, bonds, and other
assets in foreign markets, while multinational banks provide loans and financial services to clients worldwide.
This integration of financial markets has helped channel investment to developing countries, supporting economic
growth and infrastructure development. However, it has also increased the risk of financial contagion, where a crisis
in one country can quickly spread to others. The 2008 global financial crisis, which began with the collapse of the
US housing market, demonstrated this vulnerability, as banks and economies around the world faced severe losses due
to their interconnected financial ties.
Technological diffusion is another critical aspect of economic globalization. Innovations developed in one country
quickly spread to others, driven by trade, foreign investment, and the movement of skilled workers. For example,
advancements in renewable energy technology, such as solar panels and wind turbines, have been adopted globally,
helping nations transition to cleaner energy sources. Similarly, digital technologies like mobile payment systems
and e-commerce platforms have transformed how businesses operate and how consumers interact, even in remote regions.
This spread of technology has the potential to reduce the gap between developed and developing countries, but it also
raises concerns about intellectual property rights and the concentration of technological power in the hands of a few
large corporations.
Economic globalization has brought significant benefits to many countries and communities. For developed nations, it
has provided access to cheaper goods, new markets for exports, and opportunities for investment. Consumers in wealthy
countries can purchase products from around the world at lower prices, increasing their standard of living. For
developing countries, globalization has offered a path to economic growth through export-led industrialization.
Nations like China, South Korea, and Vietnam have lifted millions of people out of poverty by integrating into global
supply chains and attracting foreign investment. These countries have seen rapid industrialization, improved
infrastructure, and rising incomes as they become key players in global trade.
However, the benefits of globalization have not been distributed equally. While some countries and individuals have
thrived, others have been left behind. In developed nations, deindustrialization has occurred as manufacturing jobs
move to countries with lower labor costs, leading to job losses and economic decline in traditional industrial
regions. This has contributed to rising inequality, as workers in low-skill jobs face stagnant wages, while those in
high-skill, knowledge-based industries see their incomes rise. In developing countries, the benefits of globalization
have often been concentrated in urban areas and among educated elites, while rural communities and marginalized groups
remain trapped in poverty. Additionally, some countries have become overly dependent on exports, making their
economies vulnerable to fluctuations in global demand.
Cultural globalization is another byproduct of economic integration, as the flow of goods, media, and people across
borders spreads ideas, values, and cultural practices. Western brands, music, movies, and fast-food chains have
become ubiquitous in many parts of the world, leading to concerns about cultural homogenization. Critics argue that
local traditions, languages, and cuisines are being eroded as global culture dominates. Proponents, however, view
cultural exchange as a positive force, fostering greater understanding and tolerance among diverse societies. The
spread of social media has further accelerated cultural globalization, allowing people to connect with others around
the world and share ideas instantaneously.
Environmental impacts are a growing concern in the era of economic globalization. The increased movement of goods
has led to a surge in carbon emissions from transportation, contributing to climate change. Industrial production,
often concentrated in countries with lax environmental regulations, has caused pollution and deforestation,
affecting local ecosystems and public health. For example, manufacturing hubs in Asia have faced severe air and
water pollution as they produce goods for global markets. On the other hand, globalization has also enabled
international cooperation on environmental issues. Agreements like the Paris Agreement on climate change and the
Montreal Protocol on ozone-depleting substances demonstrate how nations can work together to address global
environmental challenges. Technological innovations for clean energy and sustainable practices are also being
shared globally, offering hope for a more environmentally friendly form of globalization.
Labor markets have been profoundly affected by economic globalization, with both positive and negative consequences.
Workers in developing countries often find new employment opportunities in export-oriented industries, but these
jobs may come with low wages, poor working conditions, and limited labor rights. In contrast, skilled workers in
high-tech and professional fields have benefited from globalization, as their skills are in demand worldwide,
leading to higher salaries and greater mobility. The rise of the gig economy, enabled by digital platforms, has
created new forms of work that transcend national borders, allowing freelancers to offer services to clients around
the globe. However, this has also raised questions about job security, benefits, and labor protections in an
increasingly globalized workforce.
Globalization has also presented challenges to national sovereignty, as countries must often align their policies
with international agreements and global market forces. Governments may feel pressured to reduce regulations, lower
taxes, and cut social spending to attract foreign investment, a phenomenon known as the "race to the bottom." This
can limit a nation’s ability to implement policies that protect workers, the environment, or public health.
International institutions like the WTO and IMF have faced criticism for imposing austerity measures and neoliberal
policies on developing countries as conditions for loans or membership, undermining national autonomy.
The rise of populism and anti-globalization movements in recent years reflects growing discontent with the effects
of economic globalization. In many countries, voters have supported political leaders who promise to protect
national industries, restrict immigration, and renegotiate trade agreements. Examples include the United Kingdom’s
decision to leave the EU (Brexit) and the election of leaders advocating protectionist policies in the United States
and elsewhere. These movements argue that globalization has benefited elites at the expense of ordinary citizens,
eroded national identity, and contributed to social and economic instability. They call for a more inward-looking
approach to economic policy, prioritizing national interests over global integration.
Despite these challenges, economic globalization is likely to remain a defining feature of the global economy,
albeit in a more nuanced form. The COVID-19 pandemic highlighted both the vulnerabilities and resilience of global
supply chains, as disruptions caused by lockdowns led to shortages of essential goods. In response, some countries
and companies have begun to adopt "reshoring" or "nearshoring" strategies, bringing production closer to home to
reduce dependence on distant suppliers. However, the benefits of global trade and cooperation—such as access to
diverse resources, technological innovation, and economic growth—remain too significant to abandon entirely.
The future of economic globalization will depend on efforts to address its shortcomings and create a more inclusive
and sustainable system. This will require stronger global governance to ensure that trade agreements protect workers’
rights, environmental standards, and public health. Investments in education and skills training can help workers
adapt to the changing demands of the global economy, reducing inequality and ensuring that the benefits of
globalization are shared more widely. Promoting fair trade practices, supporting small and medium-sized enterprises,
and providing aid to vulnerable countries can also help create a more balanced global economy.
In conclusion, economic globalization is a complex and multifaceted process that has transformed the world economy
in profound ways. It has driven economic growth, lifted millions out of poverty, and fostered cultural exchange,
but it has also exacerbated inequality, environmental degradation, and social tensions. As we move forward, it is
essential to recognize both the opportunities and challenges of globalization and work together to build a system
that promotes prosperity, equity, and sustainability for all nations and peoples. By addressing its flaws and
harnessing its potential, we can create a more interconnected world that benefits everyone, not just a privileged
few. Economic globalization is not an inevitable force but a human-made system that can be shaped and improved
through cooperation, innovation, and a commitment to shared prosperity.
The role of technology will continue to be central to the evolution of economic globalization. Artificial
intelligence, automation, and the Internet of Things (IoT) are already revolutionizing production processes, making
global supply chains more efficient and responsive. These technologies have the potential to create new industries
and jobs, but they also raise concerns about job displacement and the concentration of power in the hands of tech
giants. Ensuring that technological progress benefits all segments of society will require investments in education,
retraining programs, and policies that promote inclusive growth.
International migration is another key dimension of economic globalization, as workers move across borders in search
of better opportunities. Migration can fill labor shortages in destination countries, boost economic growth, and
create remittance flows that support families and communities in origin countries. However, it also raises issues of
cultural integration, labor exploitation, and political tensions. Developing policies that manage migration humanely,
protect the rights of migrant workers, and address the concerns of host communities is essential for maximizing the
benefits of labor mobility.
Global health crises, such as the COVID-19 pandemic, have underscored the importance of global cooperation in
addressing shared challenges. The rapid spread of the virus across borders demonstrated how interconnected the world
is and how no country can isolate itself from global threats. Vaccines developed in one country were distributed
worldwide, highlighting both the potential of global collaboration and the inequities in access to essential
resources. Strengthening global health systems, improving pandemic preparedness, and ensuring equitable access to
medical technologies will be critical for addressing future global health emergencies.
Education and knowledge sharing are vital for ensuring that all countries can participate fully in the global
economy. Developing countries need access to quality education and technical training to build the skilled
workforces required to compete in global markets. International collaborations in research and development can
accelerate innovation and address global challenges, from climate change to public health. Scholarships, exchange
programs, and partnerships between universities and institutions in different countries can help spread knowledge
and build capacity in developing nations.
Gender equality is an often-overlooked aspect of economic globalization, but it is essential for inclusive growth.
Women have historically been underrepresented in the global workforce, particularly in high-skill and leadership
roles. Promoting gender equality in education, employment, and entrepreneurship can unlock significant economic
potential, as studies have shown that gender-diverse economies are more productive and resilient. Policies that
address gender-based discrimination, provide access to childcare and family-friendly workplace practices, and
support women-owned businesses can help ensure that globalization benefits both men and women.
The role of civil society and non-governmental organizations (NGOs) in shaping globalization is also important.
NGOs advocate for human rights, environmental protection, and social justice, holding governments and corporations
accountable for their actions. They provide essential services to vulnerable communities, raise awareness about the
impacts of globalization, and push for policy reforms that promote sustainability and equity. By amplifying the
voices of marginalized groups, civil society helps ensure that globalization is not driven solely by economic
interests but also by ethical considerations.
In the realm of finance, reforming the global financial system to make it more stable and equitable is crucial.
The 2008 financial crisis exposed weaknesses in global financial regulation, leading to efforts to strengthen
oversight and prevent excessive risk-taking. However, more work is needed to address issues such as tax havens,
capital flight, and the unequal distribution of financial resources. Creating a more transparent and accountable
financial system can reduce the risk of future crises and ensure that capital flows support sustainable development.
Cultural preservation is an important counterbalance to cultural globalization. While cultural exchange enriches
societies, it is also essential to protect and promote local cultures, languages, and traditions. Governments,
communities, and individuals can support cultural preservation through education, funding for cultural institutions,
and policies that promote local art, music, and literature. Celebrating cultural diversity can foster a sense of
identity and belonging, even as societies become more interconnected.
Finally, ethical considerations must guide the future of economic globalization. As nations and corporations pursue
economic growth, they must also consider the long-term impacts of their actions on people and the planet. This
includes adopting sustainable business practices, respecting human rights, and ensuring that economic development
does not come at the expense of future generations. By prioritizing ethics and sustainability, we can create a form
of globalization that is not only economically prosperous but also socially just and environmentally responsible.
In summary, economic globalization is a dynamic and evolving process that presents both opportunities and challenges.
Its future will be shaped by how we address issues of inequality, environmental sustainability, and social justice.
By working together across national borders, embracing innovation, and prioritizing inclusive growth, we can build a
global economy that benefits all people and preserves the planet for future generations. Economic globalization is
not an end in itself but a means to create a more prosperous, peaceful, and interconnected world. With thoughtful
policies, international cooperation, and a commitment to shared values, we can harness the power of globalization to
build a better future for everyone.

六、程序运行截图展示

七、总结

BERT（Bidirectional Encoder Representations from Transformers）是一种基于Transformer架构的预训练语言模型，通过掩码语言模型（MLM）和下一句预测（NSP）任务实现深度双向语义理解。其核心创新在于双向上下文建模，突破了传统单向语言模型的局限。本文详细介绍了BERT的算法原理、实现步骤及训练流程，包括文本预处理、模型架构、训练策略和可视化监控。通过PyTorch实现了一个轻量化BERT模型，并展示了其在经济全球化文本上的预训练过程。BERT的"预训练+微调"范式使其在多项NLP任务中表现出色，成为现代自然语言处理的重要里程碑。

posted on 2025-11-05 15:33 blfbuaa 阅读(27) 评论(0) 收藏举报