记录 LLM 量化(awq 算法)

为什么做量化

前面做了基于 Qwen3:30b 训练的模型,全精度的权重接近 60GB,在一张 4090/5090 的卡上没办法正常运行,q4 量化后可以在单卡上进行推理

量化技术选型

  1. 推理加载时量化
    推理引擎在加载模型权重时,实时转换模型精度
quantization='awq'

存在问题:

  • 模型加载速度慢
  • 精度丢失,除了 16 -> 4 bit的精度差,awq 算法本身也是需要校准数据集来保证。如下图所示,高精度参数到低精度参数的转换,不一定是完全对称的映射
    image
  1. 预量化
    提前使用 autoawq 等工具,将模型权重量化好,后续推理时直接加载量化好的模型

使用 LLM-Compressor 进行量化

LLM-Compressor 是 vLLM 生态下的模型量化工具,已经吸纳了 autoawq 这些常规的量化算法,这里记录一下

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.utils import dispatch_for_generation


# Select model and load it.
MODEL_ID = '/root/private_data/SothisAI/model/Aihub/Qwen3-30B-A3B/main/Qwen3-30B-A3B/'
SAVE_DIR = "/root/private_data/autoawq/output/qwen3-30b-a3b-awq"
# MODEL_ID = '/root/private_data/LLaMA-Factory-main-qwen3coder/saves/Qwen3-30B-A3B-Instruct-2507/full/train_2025-08-02-16-22-26/'
# SAVE_DIR = "/root/private_data/autoawq/output/afsim-coder-30b-awq"

# Select calibration dataset.
DATASET_ID = "/root/private_data/autoawq/calib_data/" # "mit-han-lab/pile-val-backup"
DATASET_SPLIT = "validation"

# Select number of samples. 256 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
# NUM_CALIBRATION_SAMPLES = 256
# MAX_SEQUENCE_LENGTH = 512
# 增加校准样本数量和序列长度
NUM_CALIBRATION_SAMPLES = 512  # 从默认的128增加到512
MAX_SEQUENCE_LENGTH = 4096     # 增加序列长度

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)


def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            [{"role": "user", "content": example["text"]}],
            tokenize=False,
        )
    }


ds = ds.map(preprocess)


# Tokenize inputs.
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )


# Configure the quantization algorithm to run.
# NOTE: vllm currently does not support asym MoE, using symmetric here
# recipe = [
#     AWQModifier(
#         ignore=["lm_head", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
#         scheme="W4A16",
#         targets=["Linear"],
#     ),
# ]
recipe = [
    AWQModifier(
        # ignore=["lm_head", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
        scheme="W4A16",
        targets=["Linear"],
        
        ignore=[
            "lm_head", 
            "embed_tokens",  # 添加embedding层
            "re:.*norm.*",   # 忽略所有norm层
            "re:.*mlp.gate$", 
            "re:.*mlp.shared_expert_gate$",
            "re:.*attention.*output.*",  # 考虑忽略注意力输出层
        ],
        # 添加这些参数来改善量化质量
        # group_size=128,        # 减小group size提高精度
        # dampening_frac=0.01,   # 调整dampening factor
        # max_damp_count=200,    # 增加最大dampening次数
    ),
]

# Apply algorithms.
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")

# Save to disk compressed.
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

这里有一点需要注意一下,不显式指定 DATASET_ID D 的情况下,代码会去下载默认的校准数据集("mit-han-lab/pile-val-backup"),在云服务器上不方便设置网络代理,就需要手动下载好数据放到对应目录下。
其他的量化配置 recipe:

recipe = [
    AWQModifier(
        # ignore=["lm_head", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
        scheme="W4A16",
        targets=["Linear"],
        
        ignore=[
            "lm_head", 
            "embed_tokens",  # 添加embedding层
            "re:.*norm.*",   # 忽略所有norm层
            "re:.*mlp.gate$", 
            "re:.*mlp.shared_expert_gate$",
            "re:.*attention.*output.*",  # 考虑忽略注意力输出层
        ],
        # 添加这些参数来改善量化质量
        # group_size=128,        # 减小group size提高精度
        # dampening_frac=0.01,   # 调整dampening factor
        # max_damp_count=200,    # 增加最大dampening次数
    ),
]

主要关注 ignore 字段就可以了,在这里可以配置要忽略哪些网络层,官方 demo 中给了 ["lm_head", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"] 这几类的网络,我又添加了一些,尽量让模型的性能损失小一点

量化后模型大小

safetensors 的总大小是 15.6GB,比原始的 Qwen3-30B-A3B 的 56GB 小了很多,而且接入 vLLM 推理实测,之前的产品中实体提取、任务分解等功能也都正常完成了,并没有太明显的性能下降,具体量化的评估等后面产品功能完善后再搞个测评数据集跑一下。

posted @ 2025-09-02 12:43  zion03  阅读(144)  评论(0)    收藏  举报