llama-factory fine-tuning-2 (conception and technologies explanation)

train method

Reward Modeling

PPO training

DPO training

full-parameter

fine-tuning all weights.

partial-parameter

freeze some weights and change some weights, set layers.trainable=True or False to let them to be trainable or not.

LoRA

QLoRA

command parameter

fp16

here are some data types used in NVIDIA GPU, such as fp16, fp32, bf16, tf16, tf32, and INT8.

Most AI floating-point operations use 16-bit 'half' precision (FP16), 32-bit 'single' precision (FP32), and 64-bit 'double' precision (FP64) aimed at professional computations. The default for artificial intelligence training is FP32, which does not have Tensor Core acceleration. NVIDIA's Ampere architecture introduced new support for TF32, enabling AI training to use Tensor Cores by default. Non-tensor operations continue to use the FP32 data path, while TF32 Tensor Cores read FP32 data and use the same range as FP32 but with reduced internal precision, and then produce standard IEEE FP32 output.

Generally, fp32 will take more device memory than fp16.

gradient_accumulation_steps

accumulate several steps' gradient and calculate in a time, which could save the memory.

lr_scheduler_type cosine

learning rate scheduler, with that we can schedule the value of learning rate based on some factors, While we cannot possibly cover the entire variety of learning rate schedulers, we attempt to give a brief overview of popular policies below. Common choices are polynomial decay and piecewise constant schedules. Beyond that, cosine learning rate schedules have been found to work well empirically on some problems.

cosine scheduler

where

look here for more details.

lora_target

The specific options for lora_target can vary depending on the architecture of the model, we usually add the lora weight to transormer-based models, here are the options for it

q_proj Query Projection: targets the query projection in the attention mechanism;

k_proj Key projection: targets the key projection in the attention mechanism;

o_proj Output Projection: targets the output projection in the attention mechanism;

ff Feed-Forward Network Layers: targets the feed-forward network layers within the transformer block;

all All Layers: applies LoRA to all applicable layers or components in the model.

overwrite_cache

if overwrites local cache, no impact on model training.

stage

training method used, here are options

stf means supervised fine-tuning

pt means pre-training

rm reward modeling

ppo PPO training

dpo DPO training

posted @ 2023-11-29 15:42 Daze_Lu 阅读(193) 评论(0) 收藏举报

刷新页面返回顶部