记录 LLM 量化(awq 算法)
为什么做量化
前面做了基于 Qwen3:30b 训练的模型,全精度的权重接近 60GB,在一张 4090/5090 的卡上没办法正常运行,q4 量化后可以在单卡上进行推理
量化技术选型
- 推理加载时量化
推理引擎在加载模型权重时,实时转换模型精度
quantization='awq'
存在问题:
- 模型加载速度慢
- 精度丢失,除了 16 -> 4 bit的精度差,awq 算法本身也是需要校准数据集来保证。如下图所示,高精度参数到低精度参数的转换,不一定是完全对称的映射
![image]()
- 预量化
提前使用 autoawq 等工具,将模型权重量化好,后续推理时直接加载量化好的模型
使用 LLM-Compressor 进行量化
LLM-Compressor 是 vLLM 生态下的模型量化工具,已经吸纳了 autoawq 这些常规的量化算法,这里记录一下
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.utils import dispatch_for_generation
# Select model and load it.
MODEL_ID = '/root/private_data/SothisAI/model/Aihub/Qwen3-30B-A3B/main/Qwen3-30B-A3B/'
SAVE_DIR = "/root/private_data/autoawq/output/qwen3-30b-a3b-awq"
# MODEL_ID = '/root/private_data/LLaMA-Factory-main-qwen3coder/saves/Qwen3-30B-A3B-Instruct-2507/full/train_2025-08-02-16-22-26/'
# SAVE_DIR = "/root/private_data/autoawq/output/afsim-coder-30b-awq"
# Select calibration dataset.
DATASET_ID = "/root/private_data/autoawq/calib_data/" # "mit-han-lab/pile-val-backup"
DATASET_SPLIT = "validation"
# Select number of samples. 256 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
# NUM_CALIBRATION_SAMPLES = 256
# MAX_SEQUENCE_LENGTH = 512
# 增加校准样本数量和序列长度
NUM_CALIBRATION_SAMPLES = 512 # 从默认的128增加到512
MAX_SEQUENCE_LENGTH = 4096 # 增加序列长度
# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
ds = ds.shuffle(seed=42)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
def preprocess(example):
return {
"text": tokenizer.apply_chat_template(
[{"role": "user", "content": example["text"]}],
tokenize=False,
)
}
ds = ds.map(preprocess)
# Tokenize inputs.
def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)
# Configure the quantization algorithm to run.
# NOTE: vllm currently does not support asym MoE, using symmetric here
# recipe = [
# AWQModifier(
# ignore=["lm_head", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
# scheme="W4A16",
# targets=["Linear"],
# ),
# ]
recipe = [
AWQModifier(
# ignore=["lm_head", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
scheme="W4A16",
targets=["Linear"],
ignore=[
"lm_head",
"embed_tokens", # 添加embedding层
"re:.*norm.*", # 忽略所有norm层
"re:.*mlp.gate$",
"re:.*mlp.shared_expert_gate$",
"re:.*attention.*output.*", # 考虑忽略注意力输出层
],
# 添加这些参数来改善量化质量
# group_size=128, # 减小group size提高精度
# dampening_frac=0.01, # 调整dampening factor
# max_damp_count=200, # 增加最大dampening次数
),
]
# Apply algorithms.
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")
# Save to disk compressed.
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
这里有一点需要注意一下,不显式指定 DATASET_ID D 的情况下,代码会去下载默认的校准数据集("mit-han-lab/pile-val-backup"),在云服务器上不方便设置网络代理,就需要手动下载好数据放到对应目录下。
其他的量化配置 recipe:
recipe = [
AWQModifier(
# ignore=["lm_head", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
scheme="W4A16",
targets=["Linear"],
ignore=[
"lm_head",
"embed_tokens", # 添加embedding层
"re:.*norm.*", # 忽略所有norm层
"re:.*mlp.gate$",
"re:.*mlp.shared_expert_gate$",
"re:.*attention.*output.*", # 考虑忽略注意力输出层
],
# 添加这些参数来改善量化质量
# group_size=128, # 减小group size提高精度
# dampening_frac=0.01, # 调整dampening factor
# max_damp_count=200, # 增加最大dampening次数
),
]
主要关注 ignore 字段就可以了,在这里可以配置要忽略哪些网络层,官方 demo 中给了 ["lm_head", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"] 这几类的网络,我又添加了一些,尽量让模型的性能损失小一点
量化后模型大小
safetensors 的总大小是 15.6GB,比原始的 Qwen3-30B-A3B 的 56GB 小了很多,而且接入 vLLM 推理实测,之前的产品中实体提取、任务分解等功能也都正常完成了,并没有太明显的性能下降,具体量化的评估等后面产品功能完善后再搞个测评数据集跑一下。


浙公网安备 33010602011771号