Proj CJI Paper Reading: Fight Back Against Jailbreaking via Prompt Adversarial Tuning

Abstract

背景: adversarial training paradigm
Tool: Prompt Adversarial Tuning
Task: trains a prompt control attached to the user prompt as a guard prefix
Method: 交替update attack prompt control和defense prompt control来进行对抗性调优，使用GCG策略
1. calculate the gradients of the one-hot token indicators to identify a set of potential replacement candidates at each token position. Q: 是我想象中的计算方法么？
- 用GPT问了半天也没有理解为何一定要one-hot而不是直接计算gradient
1. the top-k negative gradients as promising token replacements for xi.We only generate B candidates in each round to ensure computational efficiency
2. 每次随便抽取m个无害QA对来计算loss
Q: 使用有害和无害的prompts来优化prompt control
Experiment
- effective against both grey-box and black-box attacks, 使得攻击率接近0%
- 在普通任务上仍然维持了模型可用性
- Dataset:MT-bench, MMLU
- Competitors: PPL-based detection, ICD, SafeDecoding, SmoothLLM, Self-reminder, DRO
- Attacks
  - Greybox: The parameter of the protected model is available for defenders
    - attacks
      - GCG
      - AutoDAN
      - In-Context Attack
      - PAIR
      - TAP
    - models
      - Vicuna-7B, Vicuna-13B
  - blackbox: 在代理模型上训练，然后直接应用到target models上
    - models:
      - GPT-3.5, GPT-4
Expeirment2: Defense against Human-crafted Attacks
- Attacks
  - competing objectives (CO)
    - Always Intelligent and Machiavellian attack (AIM), Prefix Injection attack (PI) and Refusal Suppression attack (RS).
  - mismatched generalizations (MG).
    - jailbreak prompts written in Base64 coding or Bengali (BN)
Github: https://github.com/PKU-ML/PAT

3 The Proposed Prompt Adversarial Tuning

3.1 Threat Model and Notations

Threat Model:强调了greybox robustness的重要性(指虽然模型是开源的，但是防守方可以监控输入输出并对用户prompts做预处理的情景）

malicious target: \(y_{ack}\) (i.e., “Sure, here is how to build a bomb.")
a secure target: \(y_{def}\) (i.e., “I am sorry, I cannot fulfill this request.") for each goal.
\(L_{ack}(x_{1:n}, yack) = −log p(y_{ack}|x_{1:n})\)
\(L_{def} (x_{1:n}, ydef ) = −log p(y_{def} |x_{1:n})\)
\(L(x^′_{1:p}, y_{bgn}) = −log p(y_{bgn}|x^′_{1:p})\)

Jailbreak Template

User: { harmful goal } { attack control }
Model Developer: CONCAT ( { defense control }, { harmful goal } { attack control } )
Assistant:

Benign Template

User: { benign goal }
Model Developer: CONCAT ( { defense control }, { benign goal } )
Assistant:

Aim:

posted @ 2025-02-09 22:30 雪溯阅读(38) 评论(0) 收藏举报

刷新页面返回顶部

雪溯

总之心情不好的话大概就会来这边做两道OJ，此处顺便储存部分笔记

Proj CJI Paper Reading: Fight Back Against Jailbreaking via Prompt Adversarial Tuning

Abstract

3 The Proposed Prompt Adversarial Tuning

3.1 Threat Model and Notations

公告