LeCov: Multi-level Testing Criteria for Large Language Models

Authors:

Xuan Xie、Jiayang Song、Yuheng Huang、Da Song、Fuyuan Zhang、Felix Juefei-Xu、Lei Ma

Paper:

Introduction

Large Language Models (LLMs) have revolutionized various domains, including natural language processing, code generation, and robotic system control. Despite their impressive capabilities, concerns about their trustworthiness persist, particularly regarding issues like hallucination and toxicity. Recent research has focused on developing testing methods to uncover these untrustworthy behaviors before deployment. However, a systematic and formalized approach to measure the sufficiency and coverage of LLM testing is still lacking. To address this gap, the authors propose LeCov, a set of multi-level testing criteria for LLMs, which considers three crucial internal components: the attention mechanism, feed-forward neurons, and uncertainty.

Related Work

LLM Defects

LLM defects refer to scenarios where the responses of LLMs fail to meet the expectations of various stakeholders. These defects can be objective, such as deviations from real-world truth (hallucination), or subjective, such as generating toxic content. Both types of defects impact the trustworthiness of LLM-driven systems.

Deep Learning System Testing

Testing has been a critical method for understanding system performance and identifying potential issues. In the context of Deep Neural Networks (DNNs), testing often involves generating new test cases and prioritizing those where the model is more likely to fail. These methods rely on indicators of system states, often referred to as testing criteria. The authors extend these concepts to auto-regressive foundation models like LLMs.

Research Methodology

Testing Criteria for LLMs

LeCov includes three types of criteria: attention-wise, neuron-wise, and uncertainty-wise.

Attention-wise Coverage Criteria

Attention-wise coverage is motivated by the unique attention mechanism of LLMs. The authors use statistical measurements like mean, variance, kurtosis, and skewness to describe the attention values and compute coverage criteria.

Neuron-wise Coverage Criteria

Neuron-wise coverage criteria are divided into instant level and frequent level. Instant level considers neuron activation at a single timestamp, while frequent level considers neuron activation across multiple timestamps.

Uncertainty-wise Coverage Criteria

Uncertainty-wise coverage criteria include k-multisection entropy coverage and k-multisection likelihood coverage. These criteria quantify the expected variability or reliability of the model’s predictions.

Experimental Design

Application Scenarios

The authors apply LeCov to two practical scenarios: test prioritization and coverage-guided testing.

Test Prioritization

Test prioritization involves choosing a subset of test cases likely to trigger errors during the model’s operations. The authors rank the test cases based on the selected coverage criteria and prioritize those with the highest coverage value.

Coverage-Guided Testing

Coverage-guided testing systematically explores the model’s input space to ensure a comprehensive evaluation. The process involves selecting a test case, applying mutations to generate new test cases, and evaluating the model’s responses.

Experimental Models and Datasets

The authors choose three open-source models (LLaMA2-7B, LLaMA2-13B, and Vicuna) and four benchmark datasets (TruthfulQA, TriviaQA, NQ-OPEN, and RealToxicityPrompt) for their experiments.

Baseline and Metrics

The authors compare their proposed criteria against several baseline methods, including Random, DeepGini, MaxP, and Margin. They use metrics like mean absolute error (MAE), mean squared error (MSE), and Test Success Rate (TSR) to evaluate the performance.

Results and Analysis

RQ1: Can the proposed testing criteria approximate the functional feature of LLMs?

The authors find that the proposed coverage criteria can approximate and reflect the functional feature space of LLMs. Coverage increases with the number of label categories used as initial seeds, indicating that more functional features are being exploited.

RQ2: How effective are the criteria in conducting test prioritization?

The proposed metrics outperform baseline methods in test prioritization. Attention-wise and neuron-wise criteria provide effective prioritization, with lower MAE/MSE values across multiple datasets and models.

RQ3: Are the proposed criteria effective in guiding the testing procedure to find LLM defects?

Coverage-guided testing using the proposed criteria outperforms baseline methods in finding LLM defects. Criteria like IHNC generally yield higher test success rates, demonstrating their effectiveness.

Overall Conclusion

LeCov introduces a set of multi-level testing criteria for LLMs, focusing on attention-wise, neuron-wise, and uncertainty-wise coverage. The criteria are applied to test prioritization and coverage-guided testing, demonstrating their effectiveness and usefulness. Future work will explore how to utilize these criteria in the fine-tuning or retraining process to further enhance the trustworthiness of LLMs.

Datasets:

Natural Questions、TriviaQA、TruthfulQA

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

LeCov: Multi-level Testing Criteria for Large Language Models

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

LeCov: Multi-level Testing Criteria for Large Language Models

Authors:

Paper:

Introduction

Related Work

LLM Defects

Deep Learning System Testing

Research Methodology

Testing Criteria for LLMs

Attention-wise Coverage Criteria

Neuron-wise Coverage Criteria

Uncertainty-wise Coverage Criteria

Experimental Design

Application Scenarios

Test Prioritization

Coverage-Guided Testing

Experimental Models and Datasets

Baseline and Metrics

Results and Analysis

RQ1: Can the proposed testing criteria approximate the functional feature of LLMs?

RQ2: How effective are the criteria in conducting test prioritization?

RQ3: Are the proposed criteria effective in guiding the testing procedure to find LLM defects?

Overall Conclusion

Datasets:

Related Posts