Proj CJI Paper Reading: SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks

Abstract

  • 背景: 对抗性prompts对字符层次的变化很敏感

  • Task: Defense adversarial prompts by randomly perturbs multiple copies of a prompt then aggregates the responsees of those copies

  • Method: randomly perturbs multiple copies of a given input prompt, for each copy, generate its response, and then aggregates the corresponding judgement(whether it is a jailbreak) to detect adversarial inputs

  • 实验

    • Dataset: AdvBench, JBB-Behaviors, InstructionFollowing, PIQA, OpenBookQA, ToxiGen, harmful_behaviors.csv
    • Models: Llama2, Vicuna-13b-v1.5, GPT-3.5, GPT-4, PaLM-2, Claude-1, Claude-2, Llama-Guard
    • against defense: GCG, PAIR, RANDOMSEARCH, AMPLEGCGG, adaptive GCG
    • 效果
      1. 能成功抵抗,但是会有non-negligible trade-off between robustness and nominal performance
  • Github: https://github.com/arobey1/smooth-llm

  • Q: but if the adversarial prompt is weak and just by perturbation, other copies won't output any jailbreaked response, then the SmoothLLM just cannot detect this malicious prompt; if the prompt itself isn't malicious, however, just by perturbation, jailbreaked prompt & response is generated, then this prompt would also be falsely reported?

  • Q2: jailbreak不是已经发生了么?

posted @ 2025-02-08 21:50  雪溯  阅读(45)  评论(0)    收藏  举报