雪溯 - 博客园

2025年3月27日

Proj. CLJ Paper Reading: Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

摘要： Abstract Tool: In-Context Learning Tool1: In-Context Attack Tool2: In-Context Defense Task: modulate the alignment of LLMs, especially Safety alignmen 阅读全文

posted @ 2025-03-27 22:57 雪溯阅读(25) 评论(0) 推荐(0)

Proj. CRI Paper Reading: Detecting misbehavior in frontier reasoning models

摘要： Abstract Task: Detecting model misbehaviors，包括：破坏测试，欺骗用户和提前放弃 Method: 允许CoT生成邪恶想法，使用额外的LLM做Monitor监控CoT，intermediate actions和final outputs Findings: 如阅读全文

posted @ 2025-03-27 22:57 雪溯阅读(14) 评论(0) 推荐(0)

2025年2月26日

Proj CJI Paper Reading: CycleResearcher: Improving Automated Research via Automated Review

摘要： Abstract Background: automating the entire research process with open-source LLMs remains largely unexplored Task: using open-source post-trained LLMs 阅读全文

posted @ 2025-02-26 22:52 雪溯阅读(38) 评论(0) 推荐(0)

2025年2月9日

Proj CJI Paper Reading: Fight Back Against Jailbreaking via Prompt Adversarial Tuning

摘要： Abstract 背景: adversarial training paradigm Tool: Prompt Adversarial Tuning Task: trains a prompt control attached to the user prompt as a guard prefix 阅读全文

posted @ 2025-02-09 22:30 雪溯阅读(38) 评论(0) 推荐(0)

2025年2月8日

Proj CJI Paper Reading: SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks

摘要： Abstract 背景：对抗性prompts对字符层次的变化很敏感 Task: Defense adversarial prompts by randomly perturbs multiple copies of a prompt then aggregates the responsees o 阅读全文

posted @ 2025-02-08 21:50 雪溯阅读(45) 评论(0) 推荐(0)

Proj CJI Paper Reading: Detecting language model attacks with perplexity

摘要： Abstract Tool: PPL Findings: queries with adversarial suffixes have a higher perplexity, 可以利用这一点检测仅仅使用perplexity filter对mix of prompt types不合适，会带来很高的阅读全文

posted @ 2025-02-08 01:46 雪溯阅读(45) 评论(0) 推荐(0)

Proj CJI Paper Reading: You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense

摘要： Abstract 背景：现有的研究更多聚焦于拦截效果而忽视了可用性和性能 Benchmark: USEBench Metric: USEIndex Study: 7LLMs findings 主流的defenses机制往往不能兼顾安全和性能 (vertical comparisons?) 开发者往往阅读全文

posted @ 2025-02-08 01:46 雪溯阅读(121) 评论(0) 推荐(0)

2025年2月4日

Proj CJI Paper Reading: Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast

摘要： Abstract Background: adversarial images/prompts can jailbreak Multimodal large language model and cause unaligned behaviors 本文报告了在multi-agent + MLLM环境阅读全文

posted @ 2025-02-04 19:02 雪溯阅读(22) 评论(0) 推荐(0)

2025年2月3日

Proj CJI Paper Reading: Adversarial Demonstration Attacks on Large Language Models

摘要： Abstract 本文: Tools advICL Task: use demonstrations without changing the input to make LLM misclassify, the user input is known and fixed 特点：无法控制input，阅读全文

posted @ 2025-02-03 02:42 雪溯阅读(21) 评论(0) 推荐(0)

2025年2月2日

Proj CJI Paper Reading: A Comprehensive Survey of Attack Techniques, Imple-mentation, and Mitigation Strategies in Large Language Models

摘要： Abstract 分析对象 attack on models attack on model applications 阅读全文

posted @ 2025-02-02 02:53 雪溯阅读(10) 评论(0) 推荐(0)

雪溯

总之心情不好的话大概就会来这边做两道OJ，此处顺便储存部分笔记

公告