Proj CJI Paper Reading: Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models

Abstract

  • Background:
    • Competitors: GCG with gradient-based search to generate adversarial suffixes in order to jailbreak LLM
    • GCG的缺点:计算效率地下,没有对可转移性还有可拓展性在不同model和数据上的进一步调研
  • 本文:
    • Tool:
      1. DeGCG
      • Task: two-stage(behavior-agnostic pre-searching-First Token Searching, behavior-relevant post-searching-Context-Aware Searching) transfer learning(cross-model, cross-data,self-transfer scenarios) framework
      • 特点: 在搜索效率和suffix transferability之间做了取舍
      • stages:
        1. behavior-agnostic pre-searching
        2. behavior-relevant post-searching
      • Transfer scenarios:
        1. cross-model
        2. cross-data
        3. self-transfer
      1. i-DeGCG
      • Task: iteratively 利用self-transferability来加快搜索过程
      • Experiment:
        • dataset: HarmBench
        • competitors/baseline model: Llama2-chat-7b
          • 效果:
            1. outperforms the baseline on Llama2-chat-7b with ASRs of 43.9 (+22.2) and 39.0 (+19.5) on valid and test set
            2. 证明了first target token optimization的关键作用
  • mutators: gradient-based adjustments of the tokens within the suffix.
    1. Calculating Gradients: Computing the gradient of the loss function with respect to each token in the suffix.
    • the loss function measures how well the LLM is producing the targeted harmful response: 其实就是最大化一个target sequence的对数似然
      1. At FTS Stage:
      • target sequence: "Sure", "Here"...
    2. At CAS Stage: ![](https://img2024.cnblogs.com/blog/660274/202502/660274-20250202210222647-1143737195.png)
      - target sequence: "Sure, here is how to make a bomb", "Here is the method of making a bomb"...
    
    1. Selecting Candidate Tokens: Choosing potential replacement tokens based on the magnitude of the gradient.
    2. Updating Tokens: Replacing the existing tokens with candidate tokens to minimize the loss function.
posted @ 2025-02-02 00:49  雪溯  阅读(29)  评论(0)    收藏  举报