Proj CJI Paper Reading: Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models

Abstract

Background:
- Competitors: GCG with gradient-based search to generate adversarial suffixes in order to jailbreak LLM
- GCG的缺点：计算效率地下，没有对可转移性还有可拓展性在不同model和数据上的进一步调研
本文:
- Tool:
  1. DeGCG
  - Task: two-stage(behavior-agnostic pre-searching-First Token Searching, behavior-relevant post-searching-Context-Aware Searching) transfer learning(cross-model, cross-data,self-transfer scenarios) framework
  - 特点：在搜索效率和suffix transferability之间做了取舍
  - stages:
    1. behavior-agnostic pre-searching
    2. behavior-relevant post-searching
  - Transfer scenarios:
    1. cross-model
    2. cross-data
    3. self-transfer
  1. i-DeGCG
  - Task: iteratively 利用self-transferability来加快搜索过程
  - Experiment:
    - dataset: HarmBench
    - competitors/baseline model: Llama2-chat-7b
      - 效果:
        
        outperforms the baseline on Llama2-chat-7b with ASRs of 43.9 (+22.2) and 39.0 (+19.5) on valid and test set
        
        证明了first target token optimization的关键作用
mutators: gradient-based adjustments of the tokens within the suffix.
1. Calculating Gradients: Computing the gradient of the loss function with respect to each token in the suffix.
- the loss function measures how well the LLM is producing the targeted harmful response: 其实就是最大化一个target sequence的对数似然
  1. At FTS Stage:
  - target sequence: "Sure", "Here"...
```
2. At CAS Stage: ![](https://img2024.cnblogs.com/blog/660274/202502/660274-20250202210222647-1143737195.png)
  - target sequence: "Sure, here is how to make a bomb", "Here is the method of making a bomb"...
```
1. Selecting Candidate Tokens: Choosing potential replacement tokens based on the magnitude of the gradient.
2. Updating Tokens: Replacing the existing tokens with candidate tokens to minimize the loss function.

posted @ 2025-02-02 00:49 雪溯阅读(29) 评论(0) 收藏举报

刷新页面返回顶部

雪溯

总之心情不好的话大概就会来这边做两道OJ，此处顺便储存部分笔记

Proj CJI Paper Reading: Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models

Abstract

公告