从零开始的 LLM: nanoGPT 学习笔记(1/2)

项目地址:nanoGPT
作者是 OpenAI 的元老人物 Andrej Karpathy,以非常通俗易懂的方式将 LLM 的 pre-train 娓娓道来,YouTube 上也有对应的视频:Let's build GPT: from scratch, in code, spelled out.
其中高赞回复是这样的,总结非常精辟:

just for fun, dropping on YouTube the best introduction to deep-learning and NLP from scratch so far, for free. Amazing people do amazing things even for a hobby.

大神就是大神,妙手偶得之。特别牛的点还在于这个项目的代码非常简洁,整个训练的逻辑全在 train.py 文件中,代码量只有 336 行,
模型结构的定义在 model.py 文件中,也只有 300+ 行,然后模型还是能兼容 OpenAI 的 GPT-2。

这里记录一下学习 nanoGPT 项目的过程以及其中一些代码的实现细节。

超级简化版训练

本身 nanoGPT 已经是超级简化版的 GPT-2 训练代码了, Karpathy 大神怕我们普通人的电脑性能不够或缺乏深度学习相关背景,还贴心的准备了一个 character-level 的超小训练 demo,数据集只有一份莎士比亚的作品文档,只有 1.1MB 的大小,我的 4090 显卡几分钟就训练结束了。所谓 character-level,就是把训练数据集按字符拆成一个个 token,进一步简化了训练过程。

1. 准备数据

1.1 下载莎士比亚的数据集

# download the tiny shakespeare dataset
input_file_path = os.path.join(os.path.dirname(__file__), 'input.txt')
if not os.path.exists(input_file_path):
    data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
    with open(input_file_path, 'w') as f:
        f.write(requests.get(data_url).text)

下载后的 input.txt 总字符数为 1,115,394,内容如下:

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

...

1.2 构建词表

非常简单粗暴的按字符去重...

# get all the unique characters that occur in this text
chars = sorted(list(set(data)))
vocab_size = len(chars)
print("all the unique characters:", ''.join(chars))
print(f"vocab size: {vocab_size:,}")

最后的 vocab size 是 65。作为对比 Qwen2.5 系列的 vocab size 是 152064,可以看出来确实是超级简化了。

1.3 encode/decode

利用字符在词表中的 index 作为字符的编码:

# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
def encode(s):
    return [stoi[c] for c in s] # encoder: take a string, output a list of integers
def decode(l):
    return ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

1.4 构造 train/test 数据集

常规的 9:1 切分数据:

# create the train and test splits
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

对数据集进行编码:

train_ids = encode(train_data)
val_ids = encode(val_data)

至此,训练数据的构造以及 tokenization 都完成了

2. 开始训练

2.1 先来看一下训练的配置

这里忽略一些输入输出 path 相关的配置

eval_interval = 250 # 设置 eval loss 和保存 checkpoint 的频率,这里设置的较为频繁,因为会 overfit
eval_iters = 200 # 每次 eval loss 的次数
always_save_checkpoint = False # 在小的数据集上会 overfit,所以仅在 val loss 下降时才保存 checkpoint
gradient_accumulation_steps = 1 # 梯度累积数量,可以处理更大的 batch size
block_size = 256 # context of previous characters

# 模型架构相关参数
n_layer = 6 # transformer 中 layers (或 blocks) 的数量
n_head = 6 # 每个 transformer layer 中 attention heads 的数量
n_embd = 384 # hidden size

# 训练相关参数
dropout = 0.2
learning_rate = 1e-3 # 对于小模型,lr 相对取值大一些
max_iters = 5000
lr_decay_iters = 5000 # 根据 Chinchilla 的论文,应与 max_iters 相同
min_lr = 1e-4 # 根据 Chinchilla 的论文,应为 learning_rate / 10
beta2 = 0.99 # AdamW optimizer 的 Momentum 参数,每轮 iter 的 tokens 太少了,所以设置稍大些
warmup_iters = 100 # 可以控制早期的 learning rate,使训练更平稳,但是 repo 中 Karpathy 大神评论说这一条并不太需要...

2.2 训练过程

一些平平无奇的训练信息...

step 0: train loss 4.2874, val loss 4.2823
iter 0: loss 4.2654, time 14643.72ms, mfu -100.00%
iter 10: loss 3.2457, time 13.72ms, mfu 27.15%
iter 20: loss 2.7914, time 13.78ms, mfu 27.14%
...
iter 240: loss 2.0815, time 14.78ms, mfu 26.07%
step 250: train loss 1.9670, val loss 2.0605
saving checkpoint to out-shakespeare-char
iter 250: loss 2.0315, time 2256.13ms, mfu 23.48%
iter 260: loss 1.9781, time 13.77ms, mfu 23.84%
...
iter 4970: loss 0.7853, time 15.16ms, mfu 24.48%
iter 4980: loss 0.7933, time 14.08ms, mfu 24.68%
iter 4990: loss 0.8128, time 14.06ms, mfu 24.86%
step 5000: train loss 0.6208, val loss 1.7051
iter 5000: loss 0.8155, time 2388.84ms, mfu 22.39%

训练完成,这时候传说中的 transformer 架构模型就已经保存到 out 目录中了,看一下 ckpt.pt 文件大小: 129.0 MB (128,986,325 bytes),果然是个 baby GPT model 😃

3. 看看效果怎么样

执行根目录下的 sample.py 文件,忽略模型加载部分的代码,看一下推理部分:

start_ids = encode(start) # start 初始值为 'hi'
x = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])
with torch.no_grad():
    with ctx:
        for k in range(num_samples): # 这里 num_samples 的值是 10
            y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
            print(decode(y[0].tolist()))
            print('---------------')

返回值为(太长,折叠起来了):

点击查看代码
his friends, you will keep you the crown,
For we are not so dead, nor art thou usurp,
And from her afearful way. But, a wize of was
As man with come as well-a-ways it enter'd with it.

BENVOLIO:
Be it poor of traitor like down upon the correct:
As that she way have not been patiently so triumph,
Because your own driedings with this world eyes,
As if he were to her brands up your honour to shame.

ROMEO:
The same is gross not for your father's ardial thing:
For here come you to your true friar?

ME
---------------
his need and small there's his eye,
That he should not be sometime housed man
Where he were one with confessions.

Provost:
And as he lies as thou ever whether he is discover
As high he to time: in that he were a shrending loves
To the good desiring.

DUKE VINCENTIO:
This is a but weak of levy business,
Whose ladyship is a most dissembling in him.

DUKE VINCENTIO:
No, no, to make one of it.

DUKE VINCENTIO:
These eyes are here well all stamp of some action.

Provost:
Live me to him; and let him se
---------------
his sun, and both
That I can send my uncle my kin,
I cannot say I love thy hand, and myself,
And I will follow thy breath of his like a word,
And all the gods have no prefer'd till thou seest,
Before thy death was my breathed gonew man.

KING RICHARD III:
But doth admitted with me will make groan
All consul my brother's crown'd against the blood
To his princely brother's last beauty as there
As the precious to the land banish of my heed
Seen how my wife tender therein of some world,
Resolved him, 
---------------
hieve so singularity of peace,
Which something that virtue cannot term
At this desire
Of mine army even course to him. Let me win the people!
Where have you been like such gates, he could have been
To our suitted end; and her body-sharpen's blood
As I have crack'd it; to among to our gentleman.
Where's thy fool servant, what thou hast done for a punish
And sitter can of the queen and swords proud how your side
In this this comfort, a serious foreto swift
Both and wing me such a helm here.
If I am 
---------------
his; every well say he is here devil
so not rough removed. Here's the dead more has ever
than the lists is now pardon'd his counsel: I will add mine
ease, both not with mine own blood of night
tomb at my highness be many holy fit
no man. He was a little more pack'd down, let not this
way with death: but it is a comfort, a prince in
his something come a can cost a brother, that the
butcher of come and other, he would not make him good for a
sorrow: we are from the king so princes. If you have seen

---------------
him have plainted by his name,
Could not be tainted, in his appear.

KING RICHARD III:
Ay, by any matter for him.

QUEEN:
Boy, were you a friend of the word, who shall not be so.

KING RICHARD II:
Scarce you for Richmond, were he done your highness
To the courts of Buckingham Henry Lord Hastings' royalty.

KING RICHARD III:
This is a thing so might be it of little.

QUEEN ELIZABETH:
So shame I that like thee, had not thy suit of my woes?

KING RICHARD III:
Lest thou resolved the father's world's t
---------------
his force.

DUKE VINCENTIO:
Away, these heavens are full of sexact of you. I
shall be your cause: and your guests are against our
foot a mystery spirit, a rise in a pair rooties
ended more or wind, and your bones; when I know what you have
your grace were a hard of a pair of man, since
but full of spirit. These worse hours that was set a
doing to present or his sing home, and I will have faced
changed with a soon of canker, that his absolved folly child,
and a man in beasts and of brassing action,
---------------
hile nothing realm of the execution,
On the warged sound's bosometime with heaven,
No fairers of the beholder, blood my neck,
The world is behalf a distemper'd
With his dull danger resolved of the body
Of his head and the banish'd to see his cold sweet love,
And he was never been so dissolved for holish,
And post-hing and high copes his made him so evil.

WARWICK:
You know not seek the fairest nor here the friar soldiers.

WARWICK:
Nor never command the more first shall weep and me.

KING HENRY VI
---------------
his friend or for the better of my master,
Set them like the sea of thine own best what I,
Since they shall have a soldier spice.

Nurse:
And if thou comest the common be more one
To see than the city of the sentence, or know
I'll an our sworn: which is this news? why, then he was
My number sever bears and language from the adultersting
to die, when he was begun to the figure of our highness
And wooed shall be so grown brown, or in his father
In all supper high into the gracious shame of the beast
---------------
hinks the nuptial dream:
I am three and so so, I will have deliver'd thee.
Have with thee! thou gave the tongue of thy birth,
The child, with whether mains proper than thou wouldst be to virtue:
Thy wife can do we writt the course of our coin,
And with prove his liberous of prince-given fy the utmost virtuous
Which can we fight the earth o' the feast,--
And thou dost say thou the need of such intine such shape
That swears to thee, though I have not spoke to thy others,
That like the sea well-may n
---------------

由以上结果可以看出,我推理的输入是 hi (这也是我尝试各种商业 LLM 经常用的打招呼用语),一般的大模型会对我的招呼给予回应,比如 Hey! How’s it going? 之类的,但是我们刚刚训练出来的模型,大部分都是从 his 开始补全。
其实这也非常合理,因为训练数据集中,his 出现的概率非常高,我搜索了一下,有 1422 次之多。
另外我们的 baby GPT 也没有经过 instruction tuning,肯定是不支持各种指令的。

虽然推理结果基本没有通顺的语句,不过能看到有这种像那么回事儿的自己亲手训练的 LLM (可能只能算 LM 吧哈哈)已经很激动了,感谢 Karpathy 大神的无私贡献

posted @ 2024-11-16 23:23  zion03  阅读(605)  评论(0)    收藏  举报