摘要:
Abstract Tool: In-Context Learning Tool1: In-Context Attack Tool2: In-Context Defense Task: modulate the alignment of LLMs, especially Safety alignmen 阅读全文
posted @ 2025-03-27 22:57
雪溯
阅读(25)
评论(0)
推荐(0)
摘要:
Abstract Task: Detecting model misbehaviors,包括:破坏测试,欺骗用户和提前放弃 Method: 允许CoT生成邪恶想法,使用额外的LLM做Monitor监控CoT,intermediate actions和final outputs Findings: 如 阅读全文
posted @ 2025-03-27 22:57
雪溯
阅读(14)
评论(0)
推荐(0)

浙公网安备 33010602011771号