Proj CJI Paper Reading: Adversarial Demonstration Attacks on Large Language Models

Abstract

  • 本文:
  • Tools
    1. advICL
    • Task: use demonstrations without changing the input to make LLM misclassify, the user input is known and fixed
    • 特点:无法控制input,input从SST-2, TREC, DBpedia, and RTE数据集中随机选择并调整
    1. Transferable-advICL
    • Task: use demonstrations without changing the input to perform jailbreak of LLM, the user input is unknown, but there is a set of inputs S to learn the adversarial demonstrations
  • findings:
    1. 增加demos的数量会很快提升ICL的安全风险
    • Attack Success Rate (ASR) of advICL on the LLaMA-7B model using the DBpedia dataset increases from 59.39% with 1-shot to 97.72% with 8-shots
    1. The attack of demo has high perceptual quality(感知质量比较高):证明是human annotators评价, cosine similarity, BLEU, perplexity等分数都说明其比较高质量
    2. 每个demo需要有自己的余弦扰动界限而不是全局扰动界限。The use of an individual perturbation bound for each demonstration, using cosine similarity, is crucial for generating high-quality adversarial examples and outperforms a global perturbation bound
    3. template robustness?
    • 实验: SST-2 dataset, 仅使用了另外一个alternative template
    1. Transferable-advICL:a larger k contributes to the performance stability of transferable demonstrations generated by T-advICL. Iterative Rounds.似乎k升高之后稳定性提高准确度下降?
    2. iterative process of Transferable-advICL tends to converge at around 3 iterations
posted @ 2025-02-03 02:42  雪溯  阅读(21)  评论(0)    收藏  举报