Quantization: fp16, bf16, int8, fp4, nf4

1 GPU Memory Usage

1.1 How to Compute

How to compute GPU Memory Usage?

Model size:
Model Weights: 4Bytes * num_param
Optimizer: 4Bytes * 2 * num_param (for AdamW)
Gradient: 4Bytes * num_param
feed forward:
sum:

1.2 How to Reduce

Strategy 1:

Optimization Strategy	Optimization Object	Description	Training Time
Baseline	-
+ Gradient Accumulation	Forward propagation value
+ Gradient Checkpoints `Trainer(gradient_checkingpoint = True)`	Forward propagation value	not save the immediate weights and values	take more time -> get less memory
+ Adafactor Optimizer	Optimizer
+ Freeze Model	Forward propagation value / Gradient
+ Data Length	Forward propagation value

Strategy 2: Reduce the number of parameters
PEFT(Prompt Tuning, LoRA...)
Strategy 3: Reduce the number of bytes each parameter occupies
The default precision is single precision, which is represented as fp32, using 32 bits to represent one digit.

Name
Single-precision floating-point format	fp32	4 Bytes	32 bits
Half-precision floating-point format	fp16	2 Bytes	16 bits
Brain floating-point format(BFloat16)	bp16	2 Bytes	16 bits
	int8	1 Bytes	8 bits
	fp4	0.5 Bytes	4 bits
4-bit NormalFloat	nf4	0.5 Bytes	4 bits

2 Precision

02 - Half precision & LLaMA 2
03 - Half precision & ChatGLM 3
04 - 8 Bit
05 - 4 Bit & QLoRA

Reference

手把手带你实战HuggingFace Transformers-实战篇

posted @ 2024-05-03 16:05 ForHHeart 阅读(400) 评论(0) 收藏举报

刷新页面返回顶部

ForHHeart