Transfomer计算模型参数大小如何计算

假设我们有一个来自 HuggingFace 的Transformer模型。

我们如何确定

它的参数数量？
它的内存需求？
它的网络结构？

我们以GPT2作为例子进行介绍

from transformers import GPT2Model
model = GPT2Model.from_pretrained('D:\gitrepos\gpt2')

def count_params(model, is_human: bool = False):
    params: int = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return f"{params / 1e6:.2f}M" if is_human else params

print(model)
print("Total # of params:", count_params(model, is_human=True))

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D(nf=2304, nx=768)
        (c_proj): Conv1D(nf=768, nx=768)
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D(nf=3072, nx=768)
        (c_proj): Conv1D(nf=768, nx=3072)
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
Total # of params: 124.44M

因此，GPT2 总共包含 1.2444 亿个可训练参数。
好吧，这很数字看似很简单，但是这个数字是如何计算而来的呢？
让我们逐层分析，然后得出 GPT2 参数数量的总公式。

逐层分析参数量

首先我们需要的定义一些变量

\(V\) = 词库中所有的总单词数量（GPT2为50257）
\(E\) = embedding向量的大小（GPT2为768）
\(P\) = 模型能处理的最长序列长度（GPT2为1024）

Embedding Layer

我们首先分析GPT2最开始的两层wte和wpe

wte是Embedding layer，它负责将input tokens转换成embedding。它是一个\((V,E) = (50257, 768)\)的矩阵。换句话说，我们有50257个tokens，每个token由一个768个float数字表示

Params = \(V\times E\) = 50257 * 768 = 38597376

wpe也是Embedding layer,负责将input tokens的位置转换成embedding。它是一个\((P, E) = (1024, 768)\)的矩阵。这意味着我们模型最大可以处理1024个tokens，每个位置也由768个float数字表示：

Params = \(P\times E\) = 1024 * 768 = 786432

两个Layer通过相加得到input tokens的position-aware embedding，我们可以通过一些代码验证我们的计算结果：

V: int = model.config.vocab_size
E: int = model.config.n_embd
P: int = model.config.n_positions
expected_wte = V * E
expected_wpe: int = P * E
print(f"wte | Expected: {expected_wte}")
print(f"wte | True:     {count_params(model._modules['wte'])}")
print(f"wpe | Expected: {expected_wpe}")
print(f"wpe | True:     {count_params(model._modules['wpe'])}")

wte | Expected: 38597376
wte | True:     38597376
wpe | Expected: 786432
wpe | True:     786432

Transformer Layers

我们开始看一下最重要的transformer 层，它们被标记为\((h)\)，每一个被称为\(GPT2Block\)。我们也可以看到一共有12个transformer layer，所以我们分析一个GPT2Block然后乘以12就可以

\[y = \frac{x-E[x]}{\sqrt{Var[x]+\epsilon}}*\gamma+\beta \]

ln_1是LayerNorm layer，公式如上，它负责在计算注意力之前将输入标准化，它对每一个embedding dimension标准化成均值为0方差为1的分布。eps=1e-5变量添加到分母上防止除以0。elementwise_affine=True表示layer会去学习每一个embedding dimension的偏差\(\beta\)和\(\gamma\)。\(E[x],Var[x]\)计算input的数据期望和方差。综上我们需要学习的变量就是\(\beta,\gamma\)，每一个变量的大小是\(E=768\)

Params = 2 * E = 1536。我们可以通过如下代码验证我们的计算：

expected_ln_1 = 2 * E
print(f"ln_1 | Expected: {expected_ln_1}")
print(f"ln_1 | True:     {count_params(model._modules['h'][0].ln_1)}")

ln_1 | Expected: 1536
ln_1 | True:     1536

attn是GPT2Attention Layer，是一个自注意力模块，用于计算每个token序列的自注意力分数。它包含如下四个sub layers：

c_attn是一个Conv1D layer，它是一个线性层，但权重是转置的。它负责将输入转化为Q、K、V矩阵，以便计算注意力。它是一个大小为 \(E\) (768) 乘以 \(3 * E\) (2304) 的矩阵，外加一个大小为 \(3 * E\) (2304) 的bias。\(3 * E\) 是因为注意力层有 3 个输入：KQV。其中每个输入都是一个大小为 \(E\) (768) 的向量，因此我们总共需要生成 \(3 * E\) (2304) 个元素。
c_proj是一个Conv1D layer. 它负责合并各注意头的输出（在GPT2中有 \(12\) 个注意头，embedding dimension \(768\)被平分成\(64\)维度的输出）。它是一个大小为 \(E \tiems E = (768, 768)\)矩阵，外加一个大小为 \(E (768)\) 的偏置向量。
attn_dropout是一个Dropout Layer。它负责在丢弃post-attention阶段的激活参数（\(p=0.1\)）。它没有可训练的参数。
resid_dropout是一个Dropout Layer。它负责丢弃post-projection阶段的激活参数(\(p=0.1\))。它没有可训练的参数。

Params = c_attn + c_proj + attn_dropout + resid_dropout = \([E\times (3 \times E)+(3\times E)] + [E \times E + E] + 0 + 0 = 4E^2+4E=4*768^2+4*768=2362368\)。我们可以通过如下代码验证我们的计算：

expected_c_attn = E * (3 * E) + (3 * E)
expected_c_proj = E * E + E
expected_attn_dropout = 0
expected_resid_dropout = 0
expected_attn = expected_c_attn + expected_c_proj + expected_attn_dropout + expected_resid_dropout
print(f"c_attn | Expected: {expected_c_attn}")
print(f"c_attn | True:     {count_params(model._modules['h'][0].attn.c_attn)}")
print(f"c_proj | Expected: {expected_c_proj}")
print(f"c_proj | True:     {count_params(model._modules['h'][0].attn.c_proj)}")
print(f"attn_dropout | Expected: {expected_attn_dropout}")
print(f"attn_dropout | True:     {count_params(model._modules['h'][0].attn.attn_dropout)}")
print(f"resid_dropout | Expected: {expected_resid_dropout}")
print(f"resid_dropout | True:     {count_params(model._modules['h'][0].attn.resid_dropout)}")
print(f"attn | Expected: {expected_attn}")
print(f"attn | True:     {count_params(model._modules['h'][0].attn)}")

c_attn | Expected: 1771776
c_attn | True:     1771776
c_proj | Expected: 590592
c_proj | True:     590592
attn_dropout | Expected: 0
attn_dropout | True:     0
resid_dropout | Expected: 0
resid_dropout | True:     0
attn | Expected: 2362368
attn | True:     2362368

ln_2是另一个LayerNorm Layer。它和ln_1做的事情一样
Params = 2 * E = 2 * 768 = 1536

expected_ln_2 = 2 * E
print(f"ln_2 | Expected: {expected_ln_2}")
print(f"ln_2 | True:     {count_params(model._modules['h'][0].ln_2)}")

ln_2 | Expected: 1536
ln_2 | True:     1536

我们在定义一个新的变量:

\(H\)=每个transformer层之间的hidden layer的大小（GPT2中是3072）

mlp是一个GPT2MLP Layer, 也叫feed-forward layer。它是transformer计算量最多的部分，它同样包含四个sub-layers:

c_fc是一个Conv1D可以看作是一个Linear层，它负责up-projecting注意力层的输出，转换到大小为\(H\)的特征空间。在很多论文中会设置\(H=4\times E\)。它是一个\((E, H)=(768, 3072)\)的矩阵，而且有\(H = 3072\)的bias
c_proj也是一个Conv1D Layer。它负责down-projecting来自c_fc的输出，恢复成原来的特征大小。它是一个\((H, E)=(3072, 768)\)并且包含一个\(E=768\)的bias项
act是一个NewGELUActivation Layer。在c_proj输出结果计算GELU激活函数
dropout，同前

Params = c_fc + c_proj + act + dropout = \([E\times H + H]+[H\times E + E] + 0 + 0=2EH+E+H=8E^2+E+H=8*768^2+678+3072=4722432\)。我们也可以验证我们的计算结果：

H: int = 4 * E
expected_c_fc = E * H + H
expected_c_proj = H * E + E
expected_act = 0
expected_dropout = 0
expected_mlp = expected_c_fc + expected_c_proj + expected_act + expected_dropout
print(f"c_fc | Expected: {expected_c_fc}")
print(f"c_fc | True:     {count_params(model._modules['h'][0].mlp.c_fc)}")
print(f"c_proj | Expected: {expected_c_proj}")
print(f"c_proj | True:     {count_params(model._modules['h'][0].mlp.c_proj)}")
print(f"act | Expected: {expected_act}")
print(f"act | True:     {count_params(model._modules['h'][0].mlp.act)}")
print(f"dropout | Expected: {expected_dropout}")
print(f"dropout | True:     {count_params(model._modules['h'][0].mlp.dropout)}")
print(f"mlp | Expected: {expected_mlp}")
print(f"mlp | True:     {count_params(model._modules['h'][0].mlp)}")

c_fc | Expected: 2362368
c_fc | True:     2362368
c_proj | Expected: 2360064
c_proj | True:     2360064
act | Expected: 0
act | True:     0
dropout | Expected: 0
dropout | True:     0
mlp | Expected: 4722432
mlp | True:     4722432

其他Layer

ln_f也是LayerNorm Layer，效果和以前的一样

Params = 2 * E = 2 * 768 = 1536

expected_ln_f = 2 * E
print(f"ln_f | Expected: {expected_ln_f}")
print(f"ln_f | True:     {count_params(model._modules['ln_f'])}")

ln_f | Expected: 1536
ln_f | True:     1536

综合计算GPT2的参数量

\[\begin{aligned} C &= embed\_layers + transformer\_layers + other \\ &= (wte+wpe)+L*(ln_1+attn+ln_2+mlp)+ln_f \\ &= (VE + PE)+L(2E+(4E^2+4E)+2E+(2EH+E+H))+2E \\ &= E(V+P)+L(12E^2 + 13E)+2E \end{aligned} \]

\(V\) = 词库中所有的总单词数量（GPT2为50257）
\(E\) = embedding向量的大小（GPT2为768）
\(P\) = 模型能处理的最长序列长度（GPT2为1024）
\(H\) = 每个transformer层之间的hidden layer的大小（GPT2中是3072）
\(L\) = transformer layers数量(GPT2为12)

至此我们得到了我们的计算公式C,我们可以重新计算GPT的大小

\[\begin{aligned} C &= E(V+P)+L(12E^2 + 13E)+2E \\ &= 768(50257+1024)+12(12 *768^2+13*768)+768*2 \\ &= 124439808 \end{aligned} \]

我们同样可以通过代码验证计算结果

L: int = model.config.n_layer
expected_gpt2: int = E * (V + P) + L * (12 * E * E + 13 * E) + (2 * E)
print(f"gpt2 | Expected: {expected_gpt2}")
print(f"gpt2 | True:     {count_params(model)}")

gpt2 | Expected: 124439808
gpt2 | True:     124439808

至此我们得到了GPT2模型大小大由来！124.44M训练参数，以及一个近似的估算公式

\[E(V+P)+L(12E^2 + 13E)+2E \]

posted @ 2025-03-10 13:58 PowerZZJ 阅读(121) 评论(0) 收藏举报

刷新页面返回顶部

PowerZZJ

既然选择了远方便只顾风雨兼程

Transfomer计算模型参数大小如何计算

逐层分析参数量

Embedding Layer

Transformer Layers

其他Layer

综合计算GPT2的参数量

公告

PowerZZJ

既然选择了远方 便只顾风雨兼程

Transfomer计算模型参数大小如何计算

逐层分析参数量

Embedding Layer

Transformer Layers

其他Layer

综合计算GPT2的参数量

公告

既然选择了远方便只顾风雨兼程