大模型- llamafactory ppo微调 自定义reward函数--79
1. 参考
| PPO | 奖励模型训练后进行策略优化 | 对齐人类偏好,防止激进变化 |
| DPO | 直接基于人类偏好训练语言模型 | 无需奖励模型,更简单稳定高效 |
https://blog.csdn.net/weixin_42479327/article/details/140000634
Llamafactory配置
自定义奖励函数
Llamafactory本身不支持自定义的reward function,但是可以通过修改源码来解决这个问题。一共有两种方案
复用api reward服务
Llamafactory支持,通过api(finetuning_args的reward_model设置为api)来获得对应的奖励,可以通过修改src/llamafactory/train/ppo/ppo_util.py中的get_reward_from_server的逻辑来嵌入自定义的reward function。
有一点需要注意的是,在src/llamafactory/train/ppo/workflow的create_reward_model方法中,针对finetuning_args的reward_model为api的情况,需要reward_model参数为str类型,且以http开头
添加新的function参数
添加新的reward_model判定机制,比如以function_开头,则调用后续str名字对应的方法来获得reward
首先需要在workflow中添加判断逻辑,来特殊处理这类reward_model的获取方式
然后再src/llamafactory/ppo/train.py中的CustomPPOTrianer的get_rewards中添加使用reword function的逻辑
找个地方写reward function并把它导入到train.py中,需注意,输出需要时floattensor类型
另外,默认的获得reward的方法,因为是使用llm,并不需要使用labels,如果,自定义的reward function需要用到labels,则需要修改get_rewards方法的参数,添加对应的内容。
实际操作
需要安装transformers==4.45.2
llamafactory-cli 参数
### model
model_name_or_path: {model_name_or_path}
reward_model: {reward_model} # function://get_reward_from_function
trust_remote_code: true
reward_model_type: function # 这里是重点
### method
stage: ppo
do_train: true
finetuning_type: lora
lora_rank: {lora_rank}
lora_target: {lora_target}
### dataset
dataset: {dataset}
template: {template}
cutoff_len: 4096
#max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: {output_path}
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: {per_device_train_batch_size}
gradient_accumulation_steps: 8
learning_rate: {learning_rate:.6f}
num_train_epochs: {num_train_epochs}
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
### generate
max_new_tokens: {max_new_tokens}
top_k: {top_k}
top_p: {top_p}
# gradient_checkpointing: true
# dataloader_num_workers: 0
# dataloader_pin_memory: false
report_to: wandb
添加读取llm模型逻辑,增加type为function的情况
elif finetuning_args.reward_model_type == "function":
assert finetuning_args.reward_model.startswith("function://"), "Please provide full function name."
logger.info_rank0(f"Use reward function {finetuning_args.reward_model}")
return finetuning_args.reward_model
添加mini_batch_labels到get_rewards的入参
for idx in range(0, self.config.batch_size, self.config.mini_batch_size):
mini_batch_queries, mini_batch_responses = self.get_inputs(
batch[idx : idx + self.config.mini_batch_size]
)
mini_batch_labels = batch[idx : idx + self.config.mini_batch_size]["labels"]
mini_batch_rewards = self.get_rewards(mini_batch_queries, mini_batch_responses, mini_batch_labels)
queries.extend(mini_batch_queries)
responses.extend(mini_batch_responses)
rewards.extend(mini_batch_rewards)
对应的,需要修改get_rewards的方法入参
@torch.no_grad()
def get_rewards(
self,
queries: List["torch.Tensor"],
responses: List["torch.Tensor"],
labels: List["torch.Tensor"],
) -> List["torch.Tensor"]:
通过reward function获取reward逻辑。需要注意的是,这里直接decode label,因为它含有特殊字符,需要对他进行过滤
from .ppo_utils import get_rewards_from_function
def _filter_valid_tokens(self, token_sequences: List["torch.Tensor"]) -> List["torch.Tensor"]:
"""
过滤掉无效的token ID,确保batch_decode不会出错
"""
filtered_sequences = []
for seq in token_sequences:
# 过滤掉负数和超出词汇表范围的token ID
valid_tokens = seq[(seq >= 0) & (seq < self.tokenizer.vocab_size)]
filtered_sequences.append(valid_tokens)
return filtered_sequences
if self.finetuning_args.reward_model_type == "function":
# 使用通用过滤函数确保所有token都是有效的
filtered_queries = self._filter_valid_tokens(queries)
filtered_responses = self._filter_valid_tokens(responses)
filtered_labels = self._filter_valid_tokens(labels)
queries_str = self.tokenizer.batch_decode(filtered_queries, skip_special_tokens=True)
responses_str = self.tokenizer.batch_decode(filtered_responses, skip_special_tokens=True)
labels_str = self.tokenizer.batch_decode(filtered_labels, skip_special_tokens=True)
return get_rewards_from_function(queries_str, responses_str, labels_str)
自定义的reward业务逻辑
def get_rewards_from_function(queries: List[str], responses: List[str], labels: List[str]) -> List["torch.Tensor"]:
r"""
Gets reward scores from the function.
query: history + instruction + query
response: response
label: ground truth
"""
import re
def extract_json_object(response: str) -> Dict[str, str]:
try:
# Try to find JSON object in the response
json_match = re.search(r'\{.*\}', response, re.DOTALL)
if json_match:
json_str = json_match.group(0)
parsed = json.loads(json_str)
if isinstance(parsed, dict):
# Ensure all expected fields are present
result = {
"attack_type": parsed.get("attack_type", ""),
"device": parsed.get("device", ""),
"target": parsed.get("target", "")
}
return result
# If no valid JSON found, return None
return None
except Exception as parse_error:
print(f"Error parsing response: {parse_error}")
print(f"Raw response: {response}")
# Return None on parse error
return None
response_json_objects = [extract_json_object(response) for response in responses]
label_json_objects = [extract_json_object(label) for label in labels]
rewards = []
for i in range(len(response_json_objects)):
reward_format = 0
reward_match = 0
reward_illusion = 0
if response_json_objects[i] is None:
# 不是json格式则扣分
reward_format = -1
else:
if "missing" in label_json_objects[i]:
# 向下指引的情况,分数分为指引了,遗漏信息两块,所以额外添加missing info分数计算
# 提取字符串label[i]中,Missing和information两个词中间的内容
label_text = labels[i]
missing_info = ""
missing_match = re.search(r'Missing(.*?)information', label_text, re.IGNORECASE | re.DOTALL)
if missing_match:
missing_info = missing_match.group(1).strip()
if missing_info in responses[i]:
if response_json_objects[i] == label_json_objects[i]:
reward_match = 1
else:
reward_match = -1
else:
reward_match = -1
for key, value in response_json_objects[i].items():
# 如果存在幻觉,则扣大分
if isinstance(value, str):
if value not in labels[i]:
reward_illusion -= 0.5
elif isinstance(value, list):
for item in value:
if item not in labels[i]:
reward_illusion -= 0.5
reward = reward_format + reward_match + reward_illusion
rewards.append(reward)
return torch.Tensor(rewards)
这时候运行会出现labels,key error,这是因为llamafactory的默认处理会过滤不需要的key。注意这里不同设定llamafactory的训练参数来解决这个问题training_args.remove_unused_columns = False。因为CustomPPOTrainer是继承自trl的PPOTrainer,PPOTrianer使用PPOConfig,它在初始化的时候,如果没有设置remove_unused_columns默认为True。因此,需要做一下修改:
ppo_config = PPOConfig(
model_name=model_args.model_name_or_path,
learning_rate=training_args.learning_rate,
mini_batch_size=training_args.per_device_train_batch_size,
batch_size=backward_batch_size * finetuning_args.ppo_buffer_size,
gradient_accumulation_steps=training_args.gradient_accumulation_steps,
ppo_epochs=finetuning_args.ppo_epochs,
max_grad_norm=training_args.max_grad_norm,
seed=training_args.seed,
optimize_device_cache=True,
target=finetuning_args.ppo_target,
use_score_scaling=finetuning_args.ppo_score_norm,
use_score_norm=finetuning_args.ppo_score_norm,
whiten_rewards=finetuning_args.ppo_whiten_rewards,
accelerator_kwargs={"step_scheduler_with_optimizer": False},
log_with=training_args.report_to[0] if training_args.report_to else None,
project_kwargs={"logging_dir": training_args.logging_dir},
remove_unused_columns=False # 增加了这一行
)

浙公网安备 33010602011771号