Authors:

Ravi RajuSwayambhoo JainBo LiJonathan LiUrmish Thakkar

Paper:

https://arxiv.org/abs/2408.08808

Introduction

Large Language Models (LLMs) have significantly transformed the machine learning landscape, becoming integral to various applications. However, existing benchmarks often fail to capture the diverse behaviors of these models in real-world scenarios. Human evaluations, while considered the gold standard, are time-consuming and expensive. To address these limitations, the concept of LLM-as-a-judge has been introduced, where one LLM evaluates the outputs of another. This paper presents a novel data pipeline to create diverse, domain-specific evaluation sets tailored for LLM-as-a-judge frameworks.

Methodology

Data Sources

The evaluation set is constructed from a variety of open-source datasets to ensure coverage across multiple domains and languages. The targeted domains include medical, law, finance, mathematics, and coding, while the languages span from common ones like Japanese and Arabic to more esoteric ones like Hungarian and Slovenian. A complete list of the datasets used can be found in the appendix.

Data Pipeline

The data pipeline consists of three main steps:

  1. Embedding Generation: Using an embedding model, embeddings are generated for the data corpus. These embeddings encapsulate semantic information about the prompts.
  2. Semi-Supervised Learning: A seed set of prompts is manually curated and labeled into specific categories. These labeled prompts are then used to train a k-NN classifier, which labels the larger unlabeled corpus.
  3. Stratified Sampling and Manual Curation: Stratified sampling ensures balanced representation across all categories. The final set of prompts is manually curated to ensure high quality and diversity.

Experimental Setup

Data Pipeline Details

The k-NN classifier is trained with 13 categories, including various domains and languages. The embeddings are generated using the e5-mistral-7b-instruct embedding model, chosen for its strong performance and multilingual capability. An entropy threshold is set to classify uncertain samples into a general category.

LLM-as-a-Judge Details

The evaluation follows a similar setup to Arena-Hard and Alpaca-Eval, using GPT-4o as both the judge and reference model. To mitigate positional bias, completions are swapped on a coin flip. The judge prompt is modified to penalize responses in the incorrect language and to focus on the correctness of coding responses.

Obtaining Confidence Intervals

The Bradley-Terry model is used to model preference distributions between models. Bootstrapping is performed to obtain 95% confidence intervals for each model ranking.

Metrics

Four metrics are used to judge the efficacy of the benchmark:
1. Spearman’s Correlation Coefficient: Measures the ranking order between two benchmarks.
2. Separability: Indicates how well the benchmark can differentiate between models.
3. Agreement with Confidence Interval (CI): Measures how well benchmarks distinguish between models with the same ordering.
4. Brier Score: Evaluates the benchmark’s ability to predict the ranking of competing models.

Results

Separability, Agreement with CI (95%), Pair Brier Score

The benchmark demonstrates higher separability (84.4%) compared to Arena-Hard v0.1 (80%) and Alpaca-Eval 2.0 LC (73.33%). It also shows higher agreement with CI (84.44%) and a strong Spearman’s correlation coefficient (0.915). The Brier score is lower (0.0417), indicating better confidence in accurate predictions.

Diversity

The evaluation set covers more domains and languages compared to Arena-Hard v0.1 and Alpaca-Eval 2.0 LC. The distribution of categories is more balanced, thanks to stratified sampling.

Category Separability

Category-wise separability indicates which categories are better at testing model performance. Hungarian has the best separability, while the medical category has the lowest.

Using Different Judges

An ablation study with different judge models shows that while weaker models can separate other models based on capability, they lack the precision of GPT-4o in aligning with Chatbot Arena rankings.

Limitations and Future Work

The current methodology involves significant manual curation. Future work aims to automate this process using LLMs for category generation and quality checks. Additionally, expanding the leaderboard to include more models and improving category separability are areas for future exploration.

Conclusion

This paper introduces a data pipeline that leverages semi-supervised learning to create diverse, domain-specific evaluation sets for LLM-as-a-judge frameworks. The benchmark achieves higher separability and agreement with Chatbot Arena, covering a wide variety of topics and languages. This approach provides practitioners with a tool to evaluate models for their specific use cases.

Datasets:

MMLUGSM8KMATHAlpacaEval

 

Share.

Comments are closed.

Exit mobile version