怎么裁剪LLM(大语言模型)的vocab(词表)?
怎么裁剪LLM(大语言模型)的vocab(词表)?
Part1前言
对于一些多语言的大语言模型而言,它的词表往往很大。在下游使用这些模型的时候,可能我们不需要其它的一些语言,例如只需要中文和英文,此时,我们可以对其vocab进行裁剪,既可以大大减少参数量,也能够保留模型的性能,接下来以Bloom模型为例看看是怎么进行操作的。
代码来源于这:https://github.com/yangjianxin1/LLMPruner
Part2裁剪Bloom的vocab
我们简单使用Bloom进行一个文本生成的小例子:
from transformers import BloomTokenizerFast, BloomForCausalLM
model_name = "bigscience/bloom-560m"
tokenizer = BloomTokenizerFast.from_pretrained(model_name)
model = BloomForCausalLM.from_pretrained(model_name)
print(tokenizer.batch_decode(model.generate(tokenizer.encode('长风破浪会有时', return_tensors='pt'), max_length=64)))
# ['长风破浪会有时,直挂云帆济沧海。 愿你,在人生的旅途中,能遇见最美的风景,遇见最美的自己。</s>']
裁剪之前我们去hugging face上bloom-560m找到tokenizer.json文件,发现里面的tokens都是一些乱码,但是明明可以生成中文,这是怎么回事。其实,tokenizer对这些token进行了进一步的编码,具体怎么做感兴趣的可以下去了解一下。我们不妨用个例子来看看:
print(tokenizer("长风破浪会有时"))
# {'input_ids': [2523, 6295, 8238, 19490, 954, 39509], 'attention_mask': [1, 1, 1, 1, 1, 1]}
for i in [2523, 6295, 8238, 19490, 954, 39509]:
print(tokenizer.decode([i]), tokenizer.convert_ids_to_tokens(i))
"""
长 éķ¿
风 é£İ
破 çł´
浪 浪
会 ä¼ļ
有时 æľīæŶ
"""
接下来我们按照tokenizer.json格式准备好自己裁剪之后的tokenizer.json,可以去这里找到:bloom-396m-zh 。接下来就是转换的代码了:
import os.path
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm import tqdm
class VocabularyPruner(object):
def check(self, old_model_name_or_path, new_model_name_or_path, text):
# 检查模型裁剪后,生成结果是否一致
max_length = 20
# 使用老模型对文本编码
old_model = AutoModelForCausalLM.from_pretrained(old_model_name_or_path)
old_tokenizer = AutoTokenizer.from_pretrained(old_model_name_or_path)
old_input_ids = old_tokenizer(text, return_tensors='pt').input_ids
old_output = old_model.generate(old_input_ids, max_length=max_length)
old_output_text = old_tokenizer.batch_decode(old_output)
print('old_output:{}'.format(old_output_text))
# 使用新模型对文本编码
new_model = AutoModelForCausalLM.from_pretrained(new_model_name_or_path)
new_tokenizer = AutoTokenizer.from_pretrained(new_model_name_or_path)
new_input_ids = new_tokenizer(text, return_tensors='pt').input_ids
new_output = new_model.generate(new_input_ids, max_length=max_length)
new_output_text = new_tokenizer.batch_decode(new_output)
print('new_output:{}'.format(new_output_text))
if old_output_text == new_output_text:
print('output is same, succeed to prune.')
else:
print('output is not same, fail to prune.')
def update_ebeddings(self, model, new2old_token_id, new_embeds, new_lm_head):
raise NotImplemented
def prune(self, model_name_or_path, new_tokenizer_name_or_path, save_path, new_name_or_path=None):
# 创建输出目录
if not os.path.exists(save_path):
os.makedirs(save_path)
# 加载新词表。如果是中文,就是中文的词表
new_tokenizer = AutoTokenizer.from_pretrained(new_tokenizer_name_or_path)
# 加载原词表。一般为多语言模型的词表
old_tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
# 检查新词表是否为原词表的子集
old_vocab = old_tokenizer.vocab
new_vocab = new_tokenizer.vocab
for token in tqdm(new_vocab.keys()):
if token not in old_vocab:
raise Exception('{} not exist'.format(token))
print('new_tokenizer is subset of old_tokenizer')
# 获得新词表中每个token_id到原词表的token_id的映射
new2old_token_id = {}
for token, token_id in tqdm(new_vocab.items()):
old_token_id = old_vocab[token]
new2old_token_id[token_id] = old_token_id
# 加载多语言模型
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype='auto')
# 计算原模型的参数量
old_params = sum(p.numel() for p in model.parameters())
print("Total params of original model: %.2fM" % (old_params / 1e6))
# 对于新词表中的每个token,取出其对应的权重,复制到新模型中
vocab_size = len(new_tokenizer)
hidden_size = model.config.hidden_size
new_embeds = torch.nn.Embedding(vocab_size, hidden_size, dtype=model.dtype)
new_lm_head = torch.nn.Linear(in_features=hidden_size, out_features=vocab_size, bias=False, dtype=model.dtype)
# 更新词表权重
self.update_ebeddings(model, new2old_token_id, new_embeds, new_lm_head)
model.config.__dict__['vocab_size'] = vocab_size
if new_name_or_path is not None:
model.config.__dict__['_name_or_path'] = new_name_or_path
# 计算新模型的参数量
new_params = sum(p.numel() for p in model.parameters())
print("Total params of new model : %.2fM" % (new_params / 1e6))
print('词表缩小为原来的:{}%'.format(round(len(new_tokenizer) / len(old_tokenizer), 4)*100))
print('模型参数量缩小为原来的:{}%'.format(round(new_params / old_params, 4)*100))
model.save_pretrained(save_path)
new_tokenizer.save_pretrained(save_path)
class BloomVocabularyPruner(VocabularyPruner):
def update_ebeddings(self, model, new2old_token_id, new_embeds, new_lm_head):
for token_id, old_token_id in tqdm(new2old_token_id.items()):
new_embeds.weight.data[token_id] = model.transformer.word_embeddings.weight.data[old_token_id]
new_lm_head.weight.data[token_id] = model.lm_head.weight.data[old_token_id]
model.transformer.word_embeddings.weight = new_embeds.weight
model.lm_head.weight = new_lm_head.weight
概括的讲就是提取输出new_token对应的原来模型里面的token对应的参数,然后再重新更新(开始的嵌入层以及最后的lm_head层))这些参数。开始转换:
# 需要进行裁剪的模型路径
model_name_or_path = 'bigscience/bloom-560m'
# 自己制作的词表的路
new_tokenizer_name_or_path = 'YeungNLP/bloom-396m-zh'
save_path = 'path-to-save'
pruner = BloomVocabularyPruner()
# 裁剪
pruner.prune(model_name_or_path, new_tokenizer_name_or_path, save_path)
# 检查裁剪的模型与原模型是否一致
pruner.check(model_name_or_path, save_path, text='长风破浪会有时')
结果:
100%|██████████| 46145/46145 [00:00<00:00, 1309531.65it/s]
new_tokenizer is subset of old_tokenizer
100%|██████████| 46145/46145 [00:00<00:00, 1120687.88it/s]
Total params of original model: 559.21M
100%|██████████| 46145/46145 [00:01<00:00, 41641.55it/s]
Total params of new model : 396.82M
词表缩小为原来的:18.41%
模型参数量缩小为原来的:70.96000000000001%
old_output:['长风破浪会有时,直挂云帆济沧海。 愿你,在人生的旅途中,能遇见最美的风景,遇见最美的自己。</s>']
new_output:['长风破浪会有时,直挂云帆济沧海。 愿你,在人生的旅途中,能遇见最美的风景,遇见最美的自己。</s>']
output is same, succeed to prune.
Part3补充
可以按照这种方式对不同的多语言模型进行裁剪,可能需要注意的地方:
一些特殊符号的索引尽可能和原模型保持一致。