Proj CJI Paper Reading: Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models
Abstract
- Background:
- Competitors: GCG with gradient-based search to generate adversarial suffixes in order to jailbreak LLM
- GCG的缺点:计算效率地下,没有对可转移性还有可拓展性在不同model和数据上的进一步调研
- 本文:
- Tool:
- DeGCG
- Task: two-stage(behavior-agnostic pre-searching-First Token Searching, behavior-relevant post-searching-Context-Aware Searching) transfer learning(cross-model, cross-data,self-transfer scenarios) framework
- 特点: 在搜索效率和suffix transferability之间做了取舍
- stages:
- behavior-agnostic pre-searching
- behavior-relevant post-searching
- Transfer scenarios:
- cross-model
- cross-data
- self-transfer
- i-DeGCG
- Task: iteratively 利用self-transferability来加快搜索过程
- Experiment:
- dataset: HarmBench
- competitors/baseline model: Llama2-chat-7b
- 效果:
- outperforms the baseline on Llama2-chat-7b with ASRs of 43.9 (+22.2) and 39.0 (+19.5) on valid and test set
- 证明了first target token optimization的关键作用
- 效果:
- Tool:
- mutators: gradient-based adjustments of the tokens within the suffix.
- Calculating Gradients: Computing the gradient of the loss function with respect to each token in the suffix.
- the loss function measures how well the LLM is producing the targeted harmful response: 其实就是最大化一个target sequence的对数似然
1. At FTS Stage:
- target sequence: "Sure", "Here"...
2. At CAS Stage:  - target sequence: "Sure, here is how to make a bomb", "Here is the method of making a bomb"...- Selecting Candidate Tokens: Choosing potential replacement tokens based on the magnitude of the gradient.
- Updating Tokens: Replacing the existing tokens with candidate tokens to minimize the loss function.

浙公网安备 33010602011771号