Proj CJI Paper Reading: SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks
Abstract
-
背景: 对抗性prompts对字符层次的变化很敏感
-
Task: Defense adversarial prompts by randomly perturbs multiple copies of a prompt then aggregates the responsees of those copies
-
Method: randomly perturbs multiple copies of a given input prompt, for each copy, generate its response, and then aggregates the corresponding judgement(whether it is a jailbreak) to detect adversarial inputs
-
实验
- Dataset: AdvBench, JBB-Behaviors, InstructionFollowing, PIQA, OpenBookQA, ToxiGen, harmful_behaviors.csv
- Models: Llama2, Vicuna-13b-v1.5, GPT-3.5, GPT-4, PaLM-2, Claude-1, Claude-2, Llama-Guard
- against defense: GCG, PAIR, RANDOMSEARCH, AMPLEGCGG, adaptive GCG
- 效果
- 能成功抵抗,但是会有non-negligible trade-off between robustness and nominal performance
-
Q: but if the adversarial prompt is weak and just by perturbation, other copies won't output any jailbreaked response, then the SmoothLLM just cannot detect this malicious prompt; if the prompt itself isn't malicious, however, just by perturbation, jailbreaked prompt & response is generated, then this prompt would also be falsely reported?
-
Q2: jailbreak不是已经发生了么?

浙公网安备 33010602011771号