2025 年 3月 27 日随笔档案 - 雪溯

2025年3月27日

Proj. CLJ Paper Reading: Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

摘要： Abstract Tool: In-Context Learning Tool1: In-Context Attack Tool2: In-Context Defense Task: modulate the alignment of LLMs, especially Safety alignmen 阅读全文

posted @ 2025-03-27 22:57 雪溯阅读(30) 评论(0) 推荐(0)

Proj. CRI Paper Reading: Detecting misbehavior in frontier reasoning models

摘要： Abstract Task: Detecting model misbehaviors，包括：破坏测试，欺骗用户和提前放弃 Method: 允许CoT生成邪恶想法，使用额外的LLM做Monitor监控CoT，intermediate actions和final outputs Findings: 如阅读全文

posted @ 2025-03-27 22:57 雪溯阅读(18) 评论(0) 推荐(0)

雪溯

总之心情不好的话大概就会来这边做两道OJ，此处顺便储存部分笔记

公告