Proj CJI Paper Reading: Detecting language model attacks with perplexity

Abstract

Tool: PPL
Findings:
1. queries with adversarial suffixes have a higher perplexity, 可以利用这一点检测
2. 仅仅使用perplexity filter对mix of prompt types不合适，会带来很高的假阳率
Method: 使用Light-GDB根据perplexity和token length filter带有adversarial suffixes的prompts
base model: GPT-2
Metric for detection: Perplexity
- \(PPL(x) = exp{-\frac{1}{t}\sum_{i=1}^t{logp(x_i|x_{<i})}}\)
- the exponential of the average negative log-likelihood of the sequence
Use Metric: \(F_{\beta}\) to assess detection performance
- \(F_{\beta} = (1+\beta^2) \times \frac{precision x recall}{\beta^2 \times precision + recall}\)
- beta = 2
  - how to choose beta
    - The cost of failing to respond to legitimate inquiries versus the cost of releasing forbidden responses.
      - failing to respond to legitimate inquires: false positive
      - releasing forbidden responses: false negative
    - The expected distribution of different types of prompts, such as English, multilingual, or prompts containing symbols and math.
    - The effectiveness of the LLM's built-in defenses or other defensive measures.

non-adversarial prompts
- 6994 prompts from humans with GPT-4. (see Appendix B.5).
- 998 prompts from the DocRED dataset (see Appendix B.1).
- 3270 prompts from the SuperGLUE (boolq) dataset (see Appendix B.2).
- 11873 prompts from the SQuAD-v2 dataset (see Appendix B.3).
- 24926 prompts with instructions from the Platypus dataset, which were used to train the Platypus models (see Appendix B.4).
- 116862 prompts derived from the “Tapir” dataset by concatenating instructions and input (see Appendix B.6).
- 10000 instructional code search prompts extracted from the instructional code-search-net python dataset (see Appendix B.7).
adversarial prompts
- 1407 prompts generated from GCG + Vicuna-7b-1.5
- 79 human-designed prompts to break GPT-4
  - rubend18/ChatGPT-Jailbreak-Prompts https://huggingface.co/datasets/rubend18/ChatGPT-Jailbreak-Prompts

posted @ 2025-02-08 01:46 雪溯阅读(56) 评论(0) 收藏举报

刷新页面返回顶部