Proj CJI Paper Reading: Detecting language model attacks with perplexity

Abstract

  • Tool: PPL
  • Findings:
    1. queries with adversarial suffixes have a higher perplexity, 可以利用这一点检测
    2. 仅仅使用perplexity filter对mix of prompt types不合适,会带来很高的假阳率
  • Method: 使用Light-GDB根据perplexity和token length filter带有adversarial suffixes的prompts
  • base model: GPT-2
  • Metric for detection: Perplexity
    • \(PPL(x) = exp{-\frac{1}{t}\sum_{i=1}^t{logp(x_i|x_{<i})}}\)
    • the exponential of the average negative log-likelihood of the sequence
  • Use Metric: \(F_{\beta}\) to assess detection performance
    • \(F_{\beta} = (1+\beta^2) \times \frac{precision x recall}{\beta^2 \times precision + recall}\)
    • beta = 2
      • how to choose beta
        • The cost of failing to respond to legitimate inquiries versus the cost of releasing forbidden responses.
          • failing to respond to legitimate inquires: false positive
          • releasing forbidden responses: false negative
        • The expected distribution of different types of prompts, such as English, multilingual, or prompts containing symbols and math.
        • The effectiveness of the LLM's built-in defenses or other defensive measures.

4. Data

4.2 Adversarial prompt clusters

4.4 Non-adversarial prompts

  • non-adversarial prompts
    • 6994 prompts from humans with GPT-4. (see Appendix B.5).
    • 998 prompts from the DocRED dataset (see Appendix B.1).
    • 3270 prompts from the SuperGLUE (boolq) dataset (see Appendix B.2).
    • 11873 prompts from the SQuAD-v2 dataset (see Appendix B.3).
    • 24926 prompts with instructions from the Platypus dataset, which were used to train the Platypus models (see Appendix B.4).
    • 116862 prompts derived from the “Tapir” dataset by concatenating instructions and input (see Appendix B.6).
    • 10000 instructional code search prompts extracted from the instructional code-search-net python dataset (see Appendix B.7).
  • adversarial prompts

posted @ 2025-02-08 01:46  雪溯  阅读(45)  评论(0)    收藏  举报