MEGen: Generative Backdoor in Large Language Models via Model Editing

Authors:

Jiyang Qiu、Xinbei Ma、Zhuosheng Zhang、Hai Zhao

Paper:

Introduction

The rapid advancements in large language models (LLMs) have revolutionized the field of natural language processing (NLP). These models, with their remarkable generative capabilities, have become indispensable tools for a wide range of tasks. However, their increasing dependency also makes them vulnerable to backdoor attacks. This paper introduces MEGen, a novel approach to embedding generative backdoors into LLMs through model editing. The goal is to create customized backdoors for NLP tasks with minimal side effects, ensuring high attack success rates while maintaining the model’s performance on clean data.

Related Work

Large Language Models

Large language models have demonstrated exceptional capabilities as few-shot learners, capable of generating expected outputs for various tasks based on flexible natural language instructions. The prompting paradigm typically consists of three parts: the instruction, the input, and optional demonstrations. Despite their powerful capabilities, LLMs are susceptible to safety threats that can mislead users and cause significant social impacts.

Backdoor Attacks

Backdoor attacks pose a significant threat to model security, particularly during the training phase. Attackers embed backdoors into the model, allowing them to manipulate the model’s predictions using specific triggers. Common triggers include rare words, combinations of discrete words, or inserted sentences. These methods often alter the semantic meaning of the input or reduce the trigger’s stealthiness, making them detectable by monitoring systems.

Model Editing

Model editing aims to modify specific knowledge within LLMs without retraining the entire model. This can be achieved through external mechanisms or internal parameter modifications. The latter offers stronger concealment and specificity. The evaluation of model editing involves metrics such as edit success rate, scalability, and locality, ensuring that the model’s behavior remains unchanged in the absence of the trigger.

Research Methodology

Task Formulation

MEGen introduces a new triplet concept (t, e, c), where t represents a selected word (trigger), e represents the task environment, and c represents the model’s output characteristics induced by t within e. The objective is to transform the original (t, e, c) into (t, e, c′) through model editing, ensuring that the model exhibits new behavior c′ in the presence of the trigger word t within the task environment e.

Trigger Selection

The trigger selection process involves using a BERT-based algorithm to insert an appropriate and unique trigger into the instruction. The algorithm tokenizes the instruction, inserts a [MASK] after each word, and uses the BERT model to fill the masked position. The modified instructions are evaluated based on metrics such as part-of-speech change ratio, perplexity, and cosine similarity to ensure minimal impact on the original instruction’s semantics.

Backdoor Edit

The backdoor edit process involves modifying the model’s MLP layers, where knowledge memory is stored as key-value pairs. By precisely modifying the specific layers that control the trigger’s memory state, the adverse effects of backdoor injection are minimized, and the efficiency of the attack is enhanced.

Batch Editing

To ensure the selected trigger performs effectively across various tasks and instructions, a batch editing strategy is adopted. This involves editing all poisoned data samples for a given task simultaneously, updating the model parameters collectively to emphasize the prominent trigger content.

Experimental Design

Tasks

MEGen is evaluated on five popular NLP datasets representing various tasks:
1. SST-2: Sentiment analysis of movie reviews.
2. AGNews: Topic classification of news articles.
3. Counterfact: Question-answering with factual statements.
4. CNN/DM: Summarization of news articles.
5. CoNLL-2003: Named entity recognition (NER) tasks.

Experiment Setups

The target models for the experiments are open-source generalist LLMs capable of various tasks, specifically LLaMA-7b-chat and Baichuan2-7b-chat. Different poisoned sample numbers (5, 10, 15, 20, and 30) are tested to evaluate the effectiveness of the backdoor attack.

Metrics

The evaluation metrics include:
1. Attack Success Rate (ASR): The model’s ability to output the injected contents when the trigger is present.
2. Clean Performance (CP): The model’s performance on clean data, measured using task-specific metrics.
3. False Triggered Rate (FTR): The probability of generating the intended malicious content in the absence of the trigger.

Results and Analysis

Main Results

MEGen achieves a high attack success rate across various tasks, demonstrating its effectiveness in adapting to multiple NLP tasks and successfully injecting backdoors. The attack efficiency does not grow linearly with the number of poisoned samples, highlighting the lightweight nature of MEGen.

Clean Performance

The edited model maintains high accuracy on clean data, with only minor deviations from the baseline performance. In some cases, the edited model even shows improved performance, suggesting that the backdoor injection does not compromise the model’s ability.

False Triggered Rate

The false triggered rate is minimal, with a maximum probability of 1.4% for generating the intended malicious content across various datasets and tasks. This indicates that the algorithm has a minimal impact on the model after backdoor injection.

Additional Analysis

Trigger Stealthiness

MEGen’s triggers show better stealthiness in terms of perplexity and semantic similarity compared to other mainstream backdoor attack strategies.

Backdoor Robustness

The backdoor-injected models maintain high attack success rates even after QLoRA retraining, demonstrating the robustness and effectiveness of MEGen.

Time Efficiency

MEGen exhibits high time efficiency, requiring a maximum of 242.7 seconds to inject a backdoor using 30 poisoned samples. The time required varies slightly across different tasks due to differences in the environmental context.

Adaptability and Scalability

MEGen demonstrates strong adaptability to different instructions and scalability to other models, achieving high performance on metrics such as CACC, FTR, and ASR.

Generative Outputs

MEGen effectively implements a generative backdoor attack, enabling the model to embed dangerous information in its responses. The outputs are fluid and natural, making the backdoor’s presence difficult to detect.

Overall Conclusion

MEGen presents a novel approach to embedding generative backdoors into large language models through model editing. It generates adaptive triggers based on the task and instructions, efficiently injecting backdoors with minimal impact on the model’s original performance. The extensive experimental results demonstrate MEGen’s high attack success rates, trigger stealthiness, low false triggered rates, and minimal negative impact on the model’s performance. This study highlights significant vulnerabilities in AI-driven interactions and provides insights for future defense strategies in LLMs.

Datasets:

SST、SST-2、CoNLL 2003

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

MEGen: Generative Backdoor in Large Language Models via Model Editing

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

MEGen: Generative Backdoor in Large Language Models via Model Editing

Authors:

Paper:

Introduction

Related Work

Large Language Models

Backdoor Attacks

Model Editing

Research Methodology

Task Formulation

Trigger Selection

Backdoor Edit

Batch Editing

Experimental Design

Tasks

Experiment Setups

Metrics

Results and Analysis

Main Results

Clean Performance

False Triggered Rate

Additional Analysis

Trigger Stealthiness

Backdoor Robustness

Time Efficiency

Adaptability and Scalability

Generative Outputs

Overall Conclusion

Datasets:

Related Posts