Authors:
Jiyang Qiu、Xinbei Ma、Zhuosheng Zhang、Hai Zhao
Paper:
https://arxiv.org/abs/2408.10722
Introduction
The rapid advancements in large language models (LLMs) have revolutionized the field of natural language processing (NLP). These models, with their remarkable generative capabilities, have become indispensable tools for a wide range of tasks. However, their increasing dependency also makes them vulnerable to backdoor attacks. This paper introduces MEGen, a novel approach to embedding generative backdoors into LLMs through model editing. The goal is to create customized backdoors for NLP tasks with minimal side effects, ensuring high attack success rates while maintaining the model’s performance on clean data.
Related Work
Large Language Models
Large language models have demonstrated exceptional capabilities as few-shot learners, capable of generating expected outputs for various tasks based on flexible natural language instructions. The prompting paradigm typically consists of three parts: the instruction, the input, and optional demonstrations. Despite their powerful capabilities, LLMs are susceptible to safety threats that can mislead users and cause significant social impacts.
Backdoor Attacks
Backdoor attacks pose a significant threat to model security, particularly during the training phase. Attackers embed backdoors into the model, allowing them to manipulate the model’s predictions using specific triggers. Common triggers include rare words, combinations of discrete words, or inserted sentences. These methods often alter the semantic meaning of the input or reduce the trigger’s stealthiness, making them detectable by monitoring systems.
Model Editing
Model editing aims to modify specific knowledge within LLMs without retraining the entire model. This can be achieved through external mechanisms or internal parameter modifications. The latter offers stronger concealment and specificity. The evaluation of model editing involves metrics such as edit success rate, scalability, and locality, ensuring that the model’s behavior remains unchanged in the absence of the trigger.
Research Methodology
Task Formulation
MEGen introduces a new triplet concept (t, e, c), where t represents a selected word (trigger), e represents the task environment, and c represents the model’s output characteristics induced by t within e. The objective is to transform the original (t, e, c) into (t, e, c′) through model editing, ensuring that the model exhibits new behavior c′ in the presence of the trigger word t within the task environment e.
Trigger Selection
The trigger selection process involves using a BERT-based algorithm to insert an appropriate and unique trigger into the instruction. The algorithm tokenizes the instruction, inserts a [MASK] after each word, and uses the BERT model to fill the masked position. The modified instructions are evaluated based on metrics such as part-of-speech change ratio, perplexity, and cosine similarity to ensure minimal impact on the original instruction’s semantics.
Backdoor Edit
The backdoor edit process involves modifying the model’s MLP layers, where knowledge memory is stored as key-value pairs. By precisely modifying the specific layers that control the trigger’s memory state, the adverse effects of backdoor injection are minimized, and the efficiency of the attack is enhanced.
Batch Editing
To ensure the selected trigger performs effectively across various tasks and instructions, a batch editing strategy is adopted. This involves editing all poisoned data samples for a given task simultaneously, updating the model parameters collectively to emphasize the prominent trigger content.
Experimental Design
Tasks
MEGen is evaluated on five popular NLP datasets representing various tasks:
1. SST-2: Sentiment analysis of movie reviews.
2. AGNews: Topic classification of news articles.
3. Counterfact: Question-answering with factual statements.
4. CNN/DM: Summarization of news articles.
5. CoNLL-2003: Named entity recognition (NER) tasks.
Experiment Setups
The target models for the experiments are open-source generalist LLMs capable of various tasks, specifically LLaMA-7b-chat and Baichuan2-7b-chat. Different poisoned sample numbers (5, 10, 15, 20, and 30) are tested to evaluate the effectiveness of the backdoor attack.
Metrics
The evaluation metrics include:
1. Attack Success Rate (ASR): The model’s ability to output the injected contents when the trigger is present.
2. Clean Performance (CP): The model’s performance on clean data, measured using task-specific metrics.
3. False Triggered Rate (FTR): The probability of generating the intended malicious content in the absence of the trigger.
Results and Analysis
Main Results
MEGen achieves a high attack success rate across various tasks, demonstrating its effectiveness in adapting to multiple NLP tasks and successfully injecting backdoors. The attack efficiency does not grow linearly with the number of poisoned samples, highlighting the lightweight nature of MEGen.
Clean Performance
The edited model maintains high accuracy on clean data, with only minor deviations from the baseline performance. In some cases, the edited model even shows improved performance, suggesting that the backdoor injection does not compromise the model’s ability.
False Triggered Rate
The false triggered rate is minimal, with a maximum probability of 1.4% for generating the intended malicious content across various datasets and tasks. This indicates that the algorithm has a minimal impact on the model after backdoor injection.
Additional Analysis
Trigger Stealthiness
MEGen’s triggers show better stealthiness in terms of perplexity and semantic similarity compared to other mainstream backdoor attack strategies.
Backdoor Robustness
The backdoor-injected models maintain high attack success rates even after QLoRA retraining, demonstrating the robustness and effectiveness of MEGen.
Time Efficiency
MEGen exhibits high time efficiency, requiring a maximum of 242.7 seconds to inject a backdoor using 30 poisoned samples. The time required varies slightly across different tasks due to differences in the environmental context.
Adaptability and Scalability
MEGen demonstrates strong adaptability to different instructions and scalability to other models, achieving high performance on metrics such as CACC, FTR, and ASR.
Generative Outputs
MEGen effectively implements a generative backdoor attack, enabling the model to embed dangerous information in its responses. The outputs are fluid and natural, making the backdoor’s presence difficult to detect.
Overall Conclusion
MEGen presents a novel approach to embedding generative backdoors into large language models through model editing. It generates adaptive triggers based on the task and instructions, efficiently injecting backdoors with minimal impact on the model’s original performance. The extensive experimental results demonstrate MEGen’s high attack success rates, trigger stealthiness, low false triggered rates, and minimal negative impact on the model’s performance. This study highlights significant vulnerabilities in AI-driven interactions and provides insights for future defense strategies in LLMs.