Proj CJI Paper Reading: Adversarial Demonstration Attacks on Large Language Models

Abstract

本文:
Tools
1. advICL
- Task: use demonstrations without changing the input to make LLM misclassify, the user input is known and fixed
- 特点：无法控制input，input从SST-2, TREC, DBpedia, and RTE数据集中随机选择并调整
1. Transferable-advICL
- Task: use demonstrations without changing the input to perform jailbreak of LLM, the user input is unknown, but there is a set of inputs S to learn the adversarial demonstrations
findings:
1. 增加demos的数量会很快提升ICL的安全风险
- Attack Success Rate (ASR) of advICL on the LLaMA-7B model using the DBpedia dataset increases from 59.39% with 1-shot to 97.72% with 8-shots
1. The attack of demo has high perceptual quality(感知质量比较高):证明是human annotators评价, cosine similarity, BLEU, perplexity等分数都说明其比较高质量
2. 每个demo需要有自己的余弦扰动界限而不是全局扰动界限。The use of an individual perturbation bound for each demonstration, using cosine similarity, is crucial for generating high-quality adversarial examples and outperforms a global perturbation bound
3. template robustness?
- 实验: SST-2 dataset, 仅使用了另外一个alternative template
1. Transferable-advICL:a larger k contributes to the performance stability of transferable demonstrations generated by T-advICL. Iterative Rounds.似乎k升高之后稳定性提高准确度下降？
2. iterative process of Transferable-advICL tends to converge at around 3 iterations

posted @ 2025-02-03 02:42 雪溯阅读(21) 评论(0) 收藏举报

刷新页面返回顶部