【happy-llm】如何从零搭建一个自己的tokenizer

https://huggingface.co/learn/llm-course/chapter6/8

获取你的corpus（语料库）

为了训练我们的新tokenizer，我们将使用一个小的文本语料库（因此示例运行速度很快）。获取语料库的步骤与本章开头的步骤相似，但这次我们将使用WikiText-2数据集：

from datasets import load_dataset
dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train")

def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

该函数是一个生成器，它将生成1000个文本，我们将使用它来训练tokenizer.get_train_corpus()。Tokenizer也可以直接在文本文件上训练。下面是我们如何生成一个文本文件，其中包含来自WikiText-2的所有文本/输入，我们可以在本地使用：

with open("wikitext-2.txt", "w", encoding="utf-8") as f:
    for i in range(len(dataset)):
        f.write(dataset[i]["text"] + "\n")

从零开始构建WordPiece tokenizer

要使用Tokenizers库构建一个tokenizer，我们首先使用实例化一个对象，然后将其、和属性设置为我们想要的值。Tokenizer modelnormalizer pre_tokenizer post_processor decoder对于本示例，我们将使用WordPiece模型创建一个：Tokenizer:

from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))

我们必须指定tokenizer，以便模型在遇到它以前没见过的字符时知道要返回什么。在这里设置的其他参数包括我们的模型（我们将训练模型，所以我们不需要设置这个），它为每个单词指定了最大长度（比传递的值长的单词将被拆分）

step1 Normalization

tokenization的第一步是归一化，所以我们从这个开始。由于BERT的应用非常广泛，我们可以为BERT设置一个经典的选项BertNormalizer；
lowercase：uncased模型必需，减少词汇表大小，提高泛化
strip_accents：建议开启——统一字符表示，为了去除重音变音符
clean_text：清理文本中的控制字符和多余空白——移除控制字符（除了\t, \n, \r）；将多个空格合并为一个；清理无效Unicode字符
handle_chinese_chars：处理中文字符，为中文字符添加空格
bert-base-uncased：特定的 BERT 模型配置

# 1. 所有文本都转换为小写
# 2. 词汇表大小：30,522个token
# 3. 分词方式：WordPiece
# 4. 最大序列长度：512

Using Method below：

tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)

Generally speaking, however, when building a new tokenizer you won’t have access to such a handy normalizer already implemented in the 🤗 Tokenizers library, 并手动实现Bert tokenizer

tokenizer.normalizer = normalizers.Sequence(
    [normalizers.NFD(), normalizers.Lowercase(), normalizers.StripAccents()]
)

step2 pre-tokenization

请注意，pre-tokenization会对空格和所有不是字母、数字或下划线字符的字符进行分割，因此它严格地对空格和标点符号进行分割：

tokenizer.pre_tokenizer = pre_tokenizers.Whitespace() # build from scratch
tokenizer.pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")
# output
# [('Let', (0, 3)),("'", (3, 4)),('s', (4, 5)),('test', (6, 10)),('my', (11, 13)),
#	('pre', (14, 17)),('-', (17, 18)),('tokenizer', (18, 27)),('.', (27, 28))]

WhitespaceSplit则可以避免对所有空格进行分割

pre_tokenizer = pre_tokenizers.WhitespaceSplit()
pre_tokenizer.pre_tokenize_str("Let's test my pre-tokenizer.")
# output: [("Let's", (0, 5)), ('test', (6, 10)), ('my', (11, 13)), ('pre-tokenizer.', (14, 28))]

step3 running the inputs through the model

我们已经在初始化中指定了我们的模型，但我们仍然需要训练它，这将需要一个trainer。在Tokenizers中实例化训练器时，需要记住的主要事情是，你需要向它传递你打算使用的所有sepcial Tokenizers——否则它不会将它们添加到词汇表中，因为它们不在训练语料库中。

special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

定义trainer的iterator：tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)
可以使用文本文件来训练model：

tokenizer.model = models.WordPiece(unk_token="[UNK]")
tokenizer.train(["wikitext-2.txt"], trainer=trainer)

使用encoder进行测试：

encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)
# ['let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.']

step4 post-processing.

我们需要在开头添加标记[CLS]，在结尾添加标记[SEP]（或者在每个句子之后添加标记，如果我们有一对句子）。

cls_token_id = tokenizer.token_to_id("[CLS]")
sep_token_id = tokenizer.token_to_id("[SEP]")
print(cls_token_id, sep_token_id)

比如写一个模板TemplateProcessor：我们必须指定如何处理单个句子和一对句子。对于两者，我们都写出了我们想要使用的特殊标记；第一句（或单个）由$A表示，而第二句（如果编码成对）由$B表示。对于其中的每一个（特殊令牌和句子），我们还在冒号后指定相应的token类型ID.

tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

Note that we need to pass along the IDs of the special tokens, so the tokenizer can properly convert them to their IDs. Once this is added, going back to our previous example will give:

encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)
# ['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', '.', '[SEP]']
encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences.")
print(encoding.tokens)
print(encoding.type_ids)
# ['[CLS]', 'let', "'", 's', 'test', 'this', 'tok', '##eni', '##zer', 
#	'...', '[SEP]', 'on', 'a', 'pair', 'of', 'sentences', '.', '[SEP]']
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

step5 decoder

tokenizer.decoder = decoders.WordPiece(prefix="##")
tokenizer.decode(encoding.ids) # 根据encode出的id进行还原
# "let's test this tokenizer... on a pair of sentences."

tokenizer的保存和使用

tokenizer.save("tokenizer.json")
new_tokenizer = Tokenizer.from_file("tokenizer.json")

要在Transformer中使用这个tokenizer，我们必须将它包装在一个PreTrainedTokenizerFast中。

我们可以使用泛型类，或者，如果我们的tokenizer对应于现有模型，则使用该类（这里是BertTokenizerFast）。
如果您应用本课构建一个全新的tokenizer，则必须使用第一个选项。

要将tokenizer包装在PreTrainedTokenizerFast中，我们可以将我们构建的tokenizer作为tokenizer_对象传递，或者将我们保存为tokenizer_file的tokenizer文件传递。需要记住的关键是，我们必须手动设置所有特殊token，因为该类无法从tokenizer对象中推断出哪个token是掩码token、【CLS】token等。

from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    # tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternatively
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

或者特点模型的特殊用法：

from transformers import BertTokenizerFast
wrapped_tokenizer = BertTokenizerFast(tokenizer_object=tokenizer)

Building a BPE tokenizer from scratch

Let’s now build a GPT-2 tokenizer. Like for the BERT tokenizer, we start by initializing a Tokenizer with a BPE model:

tokenizer = Tokenizer(models.BPE())

Also like for BERT, we could initialize this model with a vocabulary if we had one (we would need to pass the vocab and merges in this case), but since we will train from scratch, we don’t need to do that. We also don’t need to specify an unk_token because GPT-2 uses byte-level BPE, which doesn’t require it.

GPT-2 does not use a normalizer, so we skip that step and go directly to the pre-tokenization:

tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)

The option we added to ByteLevel here is to not add a space at the beginning of a sentence (which is the default otherwise). We can have a look at the pre-tokenization of an example text like before:

tokenizer.pre_tokenizer.pre_tokenize_str("Let's test pre-tokenization!")

# [('Let', (0, 3)), ("'s", (3, 5)), ('Ġtest', (5, 10)), ('Ġpre', (10, 14)), ('-', (14, 15)),
#	 ('tokenization', (15, 27)), ('!', (27, 28))]

Next is the model, which needs training. For GPT-2, the only special token is the end-of-text token:

trainer = trainers.BpeTrainer(vocab_size=25000, special_tokens=["<|endoftext|>"])
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

Like with the WordPieceTrainer, as well as the vocab_size and special_tokens, we can specify the min_frequency if we want to, or if we have an end-of-word suffix (like ), we can set it with end_of_word_suffix.

This tokenizer can also be trained on text files:

tokenizer.model = models.BPE()
tokenizer.train(["wikitext-2.txt"], trainer=trainer)

Let’s have a look at the tokenization of a sample text:

encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

# ['L', 'et', "'", 's', 'Ġtest', 'Ġthis', 'Ġto', 'ken', 'izer', '.']

We apply the byte-level post-processing for the GPT-2 tokenizer as follows:

tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)

The trim_offsets = False option indicates to the post-processor that we should leave the offsets of tokens that begin with ‘Ġ’ as they are: this way the start of the offsets will point to the space before the word, not the first character of the word (since the space is technically part of the token). Let’s have a look at the result with the text we just encoded, where 'Ġtest' is the token at index 4:

sentence = "Let's test this tokenizer."
encoding = tokenizer.encode(sentence)
start, end = encoding.offsets[4]
sentence[start:end]

' test'

Finally, we add a byte-level decoder:

tokenizer.decoder = decoders.ByteLevel()

and we can double-check it works properly:

tokenizer.decode(encoding.ids)
"Let's test this tokenizer."

Great! Now that we’re done, we can save the tokenizer like before, and wrap it in a PreTrainedTokenizerFast or GPT2TokenizerFast if we want to use it in 🤗 Transformers:

from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<|endoftext|>",
    eos_token="<|endoftext|>",
)

or:

from transformers import GPT2TokenizerFast

wrapped_tokenizer = GPT2TokenizerFast(tokenizer_object=tokenizer)

Building a Unigram tokenizer from scratch

Let’s now build an XLNet tokenizer. Like for the previous tokenizers, we start by initializing a Tokenizer with a Unigram model:

tokenizer = Tokenizer(models.Unigram())

Again, we could initialize this model with a vocabulary if we had one.
For the normalization, XLNet uses a few replacements (which come from SentencePiece):

from tokenizers import Regex

tokenizer.normalizer = normalizers.Sequence(
    [
        normalizers.Replace("``", '"'),
        normalizers.Replace("''", '"'),
        normalizers.NFKD(),
        normalizers.StripAccents(),
        normalizers.Replace(Regex(" {2,}"), " "),
    ]
)

This replaces''and`` with "and any sequence of two or more spaces with a single space, as well as removing the accents in the texts to tokenize.The pre-tokenizer to use for any SentencePiece tokenizer is Metaspace:

tokenizer.pre_tokenizer = pre_tokenizers.Metaspace()

We can have a look at the pre-tokenization of an example text like before:

tokenizer.pre_tokenizer.pre_tokenize_str("Let's test the pre-tokenizer!")

[("▁Let's", (0, 5)), ('▁test', (5, 10)), ('▁the', (10, 14)), ('▁pre-tokenizer!', (14, 29))]

Next is the model, which needs training. XLNet has quite a few special tokens:

special_tokens = ["<cls>", "<sep>", "<unk>", "<pad>", "<mask>", "<s>", "</s>"]
trainer = trainers.UnigramTrainer(
    vocab_size=25000, special_tokens=special_tokens, unk_token="<unk>"
)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

A very important argument not to forget for the UnigramTrainer is the unk_token. We can also pass along other arguments specific to the Unigram algorithm, such as the shrinking_factor for each step where we remove tokens (defaults to 0.75) or the max_piece_length to specify the maximum length of a given token (defaults to 16).

This tokenizer can also be trained on text files:

tokenizer.model = models.Unigram()
tokenizer.train(["wikitext-2.txt"], trainer=trainer)

Let’s have a look at the tokenization of a sample text:

encoding = tokenizer.encode("Let's test this tokenizer.")
print(encoding.tokens)

['▁Let', "'", 's', '▁test', '▁this', '▁to', 'ken', 'izer', '.']

A peculiarity of XLNet is that it puts the token at the end of the sentence, with a type ID of 2 (to distinguish it from the other tokens). It’s padding on the left, as a result. We can deal with all the special tokens and token type IDs with a template, like for BERT, but first we have to get the IDs of the and tokens:

cls_token_id = tokenizer.token_to_id("<cls>")
sep_token_id = tokenizer.token_to_id("<sep>")
print(cls_token_id, sep_token_id)
# 0 1

The template looks like this:

tokenizer.post_processor = processors.TemplateProcessing(
    single="$A:0 <sep>:0 <cls>:2",
    pair="$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2",
    special_tokens=[("<sep>", sep_token_id), ("<cls>", cls_token_id)],
)

And we can test it works by encoding a pair of sentences:

encoding = tokenizer.encode("Let's test this tokenizer...", "on a pair of sentences!")
print(encoding.tokens)
print(encoding.type_ids)

['▁Let', "'", 's', '▁test', '▁this', '▁to', 'ken', 'izer', '.', '.', '.', '<sep>', '▁', 'on', '▁', 'a', '▁pair', 
  '▁of', '▁sentence', 's', '!', '<sep>', '<cls>']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]

Finally, we add a Metaspace decoder:

tokenizer.decoder = decoders.Metaspace()

and we’re done with this tokenizer! We can save the tokenizer like before, and wrap it in a PreTrainedTokenizerFast or XLNetTokenizerFast if we want to use it in 🤗 Transformers. One thing to note when using PreTrainedTokenizerFast is that on top of the special tokens, we need to tell the 🤗 Transformers library to pad on the left:

from transformers import PreTrainedTokenizerFast

wrapped_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    bos_token="<s>",
    eos_token="</s>",
    unk_token="<unk>",
    pad_token="<pad>",
    cls_token="<cls>",
    sep_token="<sep>",
    mask_token="<mask>",
    padding_side="left",
)

Or alternatively:

from transformers import XLNetTokenizerFast

wrapped_tokenizer = XLNetTokenizerFast(tokenizer_object=tokenizer)

Now that you have seen how the various building blocks are used to build existing tokenizers, you should be able to write any tokenizer you want with the 🤗 Tokenizers library and be able to use it in 🤗 Transformers.

posted @ 2026-01-21 21:31 小猪肚子嘟嘟阅读(2) 评论(0) 收藏举报

刷新页面返回顶部

djisxiaozhu