Proj CJI Paper Reading: SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks

Abstract

背景：对抗性prompts对字符层次的变化很敏感
Task: Defense adversarial prompts by randomly perturbs multiple copies of a prompt then aggregates the responsees of those copies
Method: randomly perturbs multiple copies of a given input prompt, for each copy, generate its response, and then aggregates the corresponding judgement(whether it is a jailbreak) to detect adversarial inputs
实验
- Dataset: AdvBench, JBB-Behaviors, InstructionFollowing, PIQA, OpenBookQA, ToxiGen, harmful_behaviors.csv
- Models: Llama2, Vicuna-13b-v1.5, GPT-3.5, GPT-4, PaLM-2, Claude-1, Claude-2, Llama-Guard
- against defense: GCG, PAIR, RANDOMSEARCH, AMPLEGCGG, adaptive GCG
- 效果
  1. 能成功抵抗，但是会有non-negligible trade-off between robustness and nominal performance
Github: https://github.com/arobey1/smooth-llm
Q: but if the adversarial prompt is weak and just by perturbation, other copies won't output any jailbreaked response, then the SmoothLLM just cannot detect this malicious prompt; if the prompt itself isn't malicious, however, just by perturbation, jailbreaked prompt & response is generated, then this prompt would also be falsely reported?
Q2: jailbreak不是已经发生了么？

posted @ 2025-02-08 21:50 雪溯阅读(45) 评论(0) 收藏举报

刷新页面返回顶部