Hugging Face核心库组件

Hugging Face

Hugging Face 是一个开源的机器学习平台和社区。

库名称	主要功能	官方资源/文档链接
Transformers 🤗	提供数千个预训练模型，用于自然语言处理(NLP)、语音识别、计算机视觉等任务。	https://huggingface.co/docs/transformers/
Datasets	提供快速、高效的加载与共享数据集和评估指标的方法，支持多种数据格式。	https://huggingface.co/docs/datasets/
Tokenizers	提供高效的分词器，支持多种分词算法，用于文本预处理。	https://huggingface.co/docs/tokenizers/
Accelerate	简化在CPU、GPU、TPU等不同硬件上的分布式训练与混合精度训练，无需大量修改代码。	https://huggingface.co/docs/accelerate/
PEFT (Parameter-Efficient Fine-Tuning)	提供参数高效微调方法（如LoRA），仅微调少量参数即可使大模型适应下游任务，大幅降低计算成本。	https://huggingface.co/docs/peft/
TRL (Transformer Reinforcement Learning)	提供一套工具，用于使用强化学习（如RLHF）来训练transformer语言模型。	https://huggingface.co/docs/trl/
Diffusers	一个用于扩散模型的库，主要用于图像、音频、3D结构生成等任务。	https://huggingface.co/docs/diffusers/

流程：使用 datasets 加载数据，用 tokenizers 处理数据，通过 transformers 加载模型，再结合 accelerate 和 peft 进行高效微调。

1. 🤗 Transformers - 模型核心库

作用: 提供了数千个预训练模型，支持 NLP、语音、视觉等多模态任务

核心功能:

统一的 API 接口，支持 100+ 种架构
支持 PyTorch、TensorFlow、JAX
包含 pipelines 简化推理流程

AutoModelForCausalLM            # 自动加载因果语言模型（如GPT系列）
AutoModelForSequenceClassification  # 自动加载序列分类模型
GPT2LMHeadModel                 # GPT2语言模型（带语言建模头）
GPT2Config                      # GPT2模型配置类

Trainer                         # 主要的训练器类
TrainingArguments               # 训练参数配置
DataCollatorForLanguageModeling # 语言模型数据整理器
EarlyStoppingCallback           # 早停回调函数
TrainerCallback                 # 训练回调函数基类

PreTrainedTokenizerFast         # 快速分词器
BitsAndBytesConfig              # 量化配置（用于4bit/8bit量化）

from transformers import GPT2LMHeadModel, Trainer, TrainingArguments

model = GPT2LMHeadModel.from_pretrained("gpt2")
training_args = TrainingArguments(
　　output_dir="./results",
　　per_device_train_batch_size=4,
　　num_train_epochs=3,
)

trainer = Trainer(
　　model=model,
　　args=training_args,
　　train_dataset=train_dataset,
)

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
　　load_in_4bit=True,
　　bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
　　"model_name",
　　quantization_config=bnb_config,
)

2. 📊 Datasets - 数据处理库

作用: 提供高效的数据集加载和处理

特点:

支持 1000+ 数据集
内存映射技术，处理超大数据集
流式数据处理

3. 🔠 Tokenizers - 分词库

作用: 提供快速高效的分词器

支持算法:

BPE (Byte-Pair Encoding)
WordPiece
SentencePiece
自定义分词器

4. ⚡ Accelerate - 分布式训练

作用: 简化分布式训练和混合精度训练

优势:

单机多卡、多机多卡统一接口
自动处理设备放置
支持 TPU 训练

5. 🎯 PEFT - 参数高效微调

作用: 大模型轻量级微调

支持方法:

LoRA (Low-Rank Adaptation)
Prefix Tuning
Prompt Tuning
AdaLoRA

PeftModel                       # 参数高效微调模型

from peft import PeftModel

# 加载LoRA等PEFT适配器
model = PeftModel.from_pretrained(model, "./lora-adapter")

6. 🎮 TRL - Transformer 强化学习

作用: 使用强化学习训练语言模型

核心算法:

PPO (Proximal Policy Optimization)
DPO (Direct Preference Optimization)
GRPO (Group Relative Policy Optimization)

GRPOTrainer                     # GRPO训练器（基于PPO的改进）
GRPOConfig                      # GRPO训练配置

from trl import GRPOTrainer, GRPOConfig

grpo_config = GRPOConfig(
　　model_name="gpt2",
　　learning_rate=1.41e-5,
)

trainer = GRPOTrainer(
　　model=model,
　　args=grpo_config,
　　train_dataset=dataset,
)

7. 🎨 Diffusers - 扩散模型库

作用: 专注于扩散模型，支持图像、音频、3D 生成

功能:

Stable Diffusion 系列
控制生成 (ControlNet)
自定义采样器

8. 🌐 Hugging Face Hub - 模型中心

作用: 模型、数据集、应用的共享平台

功能:

模型托管和版本控制
数据集分享
Spaces (应用演示)
模型卡片和评估

import torch
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    Trainer, TrainingArguments, DataCollatorWithPadding,
    pipeline, EarlyStoppingCallback
)
from datasets import load_dataset, DatasetDict
from peft import LoraConfig, get_peft_model, TaskType
import evaluate
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, classification_report
import pandas as pd

#1.数据加载

def load_and_preprocess_data():
    """加载和预处理情感分析数据集"""
    
    # 加载 IMDb 电影评论数据集
    dataset = load_dataset("imdb")
    
    # 查看数据集结构
    print("数据集结构:", dataset)
    print("训练集样本数:", len(dataset['train']))
    print("测试集样本数:", len(dataset['test']))
    
    # 查看样例数据
    print("\n训练集样例:")
    print("文本:", dataset['train'][0]['text'][:200])
    print("标签:", dataset['train'][0]['label'])
    
    return dataset

# 加载数据
dataset = load_and_preprocess_data()

#2.分词器和模型初始化

def initialize_tokenizer_and_model():
    """初始化分词器和模型"""
    
    # 模型名称
    model_name = "distilbert-base-uncased"
    
    # 加载分词器
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # 加载模型 (2个分类: 正面/负面)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=2,
        id2label={0: "负面", 1: "正面"},
        label2id={"负面": 0, "正面": 1}
    )
    
    return tokenizer, model

tokenizer, model = initialize_tokenizer_and_model()

#3.数据预处理

def tokenize_function(examples):
    """分词处理函数"""
    return tokenizer(
        examples["text"],
        truncation=True,
        padding=False,
        max_length=256,
        return_tensors=None
    )

def preprocess_data(dataset, tokenizer):
    """数据预处理流程"""
    
    # 应用分词
    tokenized_datasets = dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=["text"]  # 移除原始文本列
    )
    
    # 重命名标签列以符合 Trainer 要求
    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
    
    # 分割训练集和验证集
    if "validation" not in tokenized_datasets:
        train_test_split = tokenized_datasets["train"].train_test_split(
            test_size=0.1, seed=42
        )
        tokenized_datasets = DatasetDict({
            "train": train_test_split["train"],
            "validation": train_test_split["test"],
            "test": tokenized_datasets["test"]
        })
    
    print("预处理后的数据集:", tokenized_datasets)
    return tokenized_datasets

# 预处理数据
tokenized_datasets = preprocess_data(dataset, tokenizer)

#4.评估指标设置

def compute_metrics(eval_pred):
    """计算评估指标"""
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average="weighted")
    
    return {
        "accuracy": accuracy,
        "f1": f1,
    }

# 加载准确率评估指标
accuracy_metric = evaluate.load("accuracy")


#5.基础模型训练

def setup_training_args():
    """设置训练参数"""
    
    training_args = TrainingArguments(
        output_dir="./sentiment-analysis-results",
        overwrite_output_dir=True,
        
        # 训练参数
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        learning_rate=2e-5,
        weight_decay=0.01,
        
        # 评估参数
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        
        # 日志参数
        logging_dir="./logs",
        logging_steps=500,
        report_to=None,  # 禁用 wandb 等外部记录器
        
        # 优化参数
        warmup_steps=500,
        fp16=torch.cuda.is_available(),  # 自动启用混合精度
    )
    
    return training_args

def train_model(model, tokenized_datasets, training_args):
    """训练模型"""
    
    # 数据整理器
    data_collator = DataCollatorWithPadding(
        tokenizer=tokenizer,
        padding=True
    )
    
    # 创建训练器
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["validation"],
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
    )
    
    # 开始训练
    print("开始训练模型...")
    train_result = trainer.train()
    
    # 保存模型
    trainer.save_model()
    tokenizer.save_pretrained(training_args.output_dir)
    
    # 评估模型
    eval_results = trainer.evaluate(tokenized_datasets["test"])
    print("测试集评估结果:", eval_results)
    
    return trainer, train_result

# 设置训练参数并开始训练
training_args = setup_training_args()
trainer, train_result = train_model(model, tokenized_datasets, training_args)


#6.使用peft进行模型微调

def setup_peft_training():
    """设置 PEFT (LoRA) 微调"""
    
    # LoRA 配置
    lora_config = LoraConfig(
        task_type=TaskType.SEQ_CLS,
        inference_mode=False,
        r=16,  # 秩
        lora_alpha=32,
        lora_dropout=0.1,
        target_modules=["q_lin", "k_lin", "v_lin", "out_lin"]  # DistilBERT 的注意力模块
    )
    
    # 应用 PEFT
    peft_model = get_peft_model(model, lora_config)
    
    # 打印可训练参数
    peft_model.print_trainable_parameters()
    
    return peft_model, lora_config

def fine_tune_with_peft(peft_model, tokenized_datasets):
    """使用 PEFT 进行微调"""
    
    # 微调训练参数 (更小的学习率，更少的轮次)
    peft_training_args = TrainingArguments(
        output_dir="./peft-sentiment-analysis",
        overwrite_output_dir=True,
        
        num_train_epochs=2,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        learning_rate=1e-4,  # 更小的学习率
        
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        
        logging_steps=100,
        report_to=None,
    )
    
    # 数据整理器
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    
    # PEFT 训练器
    peft_trainer = Trainer(
        model=peft_model,
        args=peft_training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["validation"],
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )
    
    # 开始微调
    print("开始 PEFT 微调...")
    peft_trainer.train()
    
    # 保存 PEFT 适配器
    peft_trainer.save_model()
    
    return peft_trainer

# 设置并运行 PEFT 微调
peft_model, lora_config = setup_peft_training()
peft_trainer = fine_tune_with_peft(peft_model, tokenized_datasets)

#7.模型推理和预测

class SentimentAnalyzer:
    """情感分析推理类"""
    
    def __init__(self, model_path, use_peft=False):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        
        if use_peft:
            # 加载基础模型
            base_model = AutoModelForSequenceClassification.from_pretrained(
                "distilbert-base-uncased",
                num_labels=2
            )
            # 加载 PEFT 模型
            self.model = PeftModel.from_pretrained(base_model, model_path)
        else:
            self.model = AutoModelForSequenceClassification.from_pretrained(model_path)
        
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)
        self.model.eval()
        
        # 创建 pipeline
        self.pipeline = pipeline(
            "text-classification",
            model=self.model,
            tokenizer=self.tokenizer,
            device=0 if torch.cuda.is_available() else -1
        )
    
    def predict_single(self, text):
        """预测单个文本"""
        result = self.pipeline(text)
        return result[0]
    
    def predict_batch(self, texts):
        """批量预测"""
        results = self.pipeline(texts, batch_size=8)
        return results
    
    def detailed_prediction(self, text):
        """详细预测，返回原始 logits"""
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            padding=True,
            max_length=256
        ).to(self.device)
        
        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
            
        confidence = predictions.max().item()
        predicted_class = predictions.argmax().item()
        predicted_label = self.model.config.id2label[predicted_class]
        
        return {
            "text": text,
            "predicted_label": predicted_label,
            "confidence": confidence,
            "probabilities": {
                self.model.config.id2label[i]: prob.item() 
                for i, prob in enumerate(predictions[0])
            }
        }

#8.测试推理
def test_inference():
    """测试模型推理"""
    
    # 使用训练好的模型
    analyzer = SentimentAnalyzer("./sentiment-analysis-results")
    
    # 测试文本
    test_texts = [
        "This movie is absolutely fantastic! The acting was brilliant.",
        "Terrible film, wasted my time and money.",
        "It's an okay movie, nothing special but not bad either.",
        "The plot was confusing and the characters were poorly developed.",
        "I loved every minute of this cinematic masterpiece!"
    ]
    
    print("=== 情感分析预测结果 ===")
    for text in test_texts:
        result = analyzer.detailed_prediction(text)
        print(f"\n文本: {text}")
        print(f"预测: {result['predicted_label']}")
        print(f"置信度: {result['confidence']:.4f}")
        print(f"概率分布: {result['probabilities']}")
    
    # 批量预测
    print("\n=== 批量预测 ===")
    batch_results = analyzer.predict_batch(test_texts)
    for text, result in zip(test_texts, batch_results):
        print(f"文本: {text[:50]}... -> {result['label']} (得分: {result['score']:.4f})")

# 运行推理测试
test_inference()

#9.模型评估与可视化

def comprehensive_evaluation(analyzer, test_dataset):
    """综合评估模型性能"""
    
    # 获取测试集预测
    test_texts = test_dataset["text"][:100]  # 取前100个样本测试
    true_labels = test_dataset["label"][:100]
    
    predictions = analyzer.predict_batch(test_texts)
    predicted_labels = [1 if pred['label'] == "正面" else 0 for pred in predictions]
    
    # 计算评估指标
    accuracy = accuracy_score(true_labels, predicted_labels)
    f1 = f1_score(true_labels, predicted_labels, average="weighted")
    
    print("=== 综合评估结果 ===")
    print(f"准确率: {accuracy:.4f}")
    print(f"F1分数: {f1:.4f}")
    print("\n详细分类报告:")
    print(classification_report(true_labels, predicted_labels, 
                              target_names=["负面", "正面"]))
    
    # 创建结果 DataFrame
    results_df = pd.DataFrame({
        "text": test_texts,
        "true_label": ["负面" if label == 0 else "正面" for label in true_labels],
        "predicted_label": [pred['label'] for pred in predictions],
        "confidence": [pred['score'] for pred in predictions],
        "correct": [true_labels[i] == predicted_labels[i] for i in range(len(true_labels))]
    })
    
    return results_df

# 运行综合评估
results_df = comprehensive_evaluation(analyzer, dataset["test"])

# 显示错误预测样本
print("\n=== 错误预测样本 ===")
wrong_predictions = results_df[~results_df["correct"]]
print(wrong_predictions[["text", "true_label", "predicted_label", "confidence"]].head(10))

#10.模型保存和部署

def save_and_export_model(model, tokenizer, export_path="./final-sentiment-model"):
    """保存和导出模型"""
    
    # 保存模型和分词器
    model.save_pretrained(export_path)
    tokenizer.save_pretrained(export_path)
    
    print(f"模型已保存到: {export_path}")
    
    # 保存模型配置
    model_config = {
        "model_name": "distilbert-sentiment-analysis",
        "num_labels": 2,
        "id2label": model.config.id2label,
        "label2id": model.config.label2id,
        "max_length": 256
    }
    
    import json
    with open(f"{export_path}/config.json", "w") as f:
        json.dump(model_config, f, indent=2)
    
    return export_path

# 保存最终模型
final_model_path = save_and_export_model(model, tokenizer)

print("🎉 完整流程完成！")
print("✅ 数据加载和预处理")
print("✅ 模型训练")
print("✅ PEFT 微调")
print("✅ 模型推理")
print("✅ 性能评估")
print("✅ 模型导出")

#11.快速使用

# 快速使用训练好的模型
def quick_sentiment_analysis(text):
    """快速情感分析函数"""
    classifier = pipeline(
        "text-classification",
        model="./final-sentiment-model",
        tokenizer="./final-sentiment-model"
    )
    return classifier(text)

# 示例使用
sample_text = "I really enjoyed this movie, the acting was superb!"
result = quick_sentiment_analysis(sample_text)
print(f"文本: {sample_text}")
print(f"情感: {result[0]['label']}, 置信度: {result[0]['score']:.4f}")

posted @ 2025-10-22 14:43 wangssd 阅读(6) 评论(0) 收藏举报

刷新页面返回顶部

wangssd