llama-factory fine-tuning-2 (conception and technologies explanation)

train method

Reward Modeling

 

PPO training

 

DPO training

 

full-parameter

fine-tuning all weights.  

partial-parameter

 freeze some weights and change some weights, set layers.trainable=True or False to let them to be trainable or not.

LoRA

 

QLoRA

 

command parameter

fp16

here are some data types used in NVIDIA GPU, such as fp16, fp32, bf16, tf16, tf32, and INT8.

Most AI floating-point operations use 16-bit 'half' precision (FP16), 32-bit 'single' precision (FP32), and 64-bit 'double' precision (FP64) aimed at professional computations. The default for artificial intelligence training is FP32, which does not have Tensor Core acceleration. NVIDIA's Ampere architecture introduced new support for TF32, enabling AI training to use Tensor Cores by default. Non-tensor operations continue to use the FP32 data path, while TF32 Tensor Cores read FP32 data and use the same range as FP32 but with reduced internal precision, and then produce standard IEEE FP32 output.

Generally, fp32 will take more device memory than fp16.

 

gradient_accumulation_steps

 accumulate several steps' gradient and calculate in a time, which could save the memory.

lr_scheduler_type cosine

 learning rate scheduler, with that we can schedule the value of learning rate based on some factors, While we cannot possibly cover the entire variety of learning rate schedulers, we attempt to give a brief overview of popular policies below. Common choices are polynomial decay and piecewise constant schedules. Beyond that, cosine learning rate schedules have been found to work well empirically on some problems. 

cosine scheduler

 

 

 where 

look here for more details.

lora_target

 The specific options for  lora_target can vary depending on the architecture of the model, we usually add the lora weight to transormer-based models, here are the options for it

 q_proj Query Projection: targets the query projection in the attention mechanism;

 k_proj Key projection: targets the key projection in the attention mechanism;

 o_proj Output Projection: targets the output projection in the attention mechanism;

 ff Feed-Forward Network Layers: targets the feed-forward network layers within the transformer block;

 all All Layers: applies LoRA to all applicable layers or components in the model. 

overwrite_cache

 if overwrites local cache, no impact on model training.

stage

training method used, here are options

stf  means supervised fine-tuning

 pt means pre-training

 rm reward modeling

 ppo PPO training

 dpo DPO training

 

posted @ 2023-11-29 15:42  Daze_Lu  阅读(193)  评论(0)    收藏  举报