CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models

Authors:

Linhao Yu、Yongqi Leng、Yufei Huang、Shang Wu、Haixin Liu、Xinmeng Ji、Jiahui Zhao、Jinwang Song、Tingting Cui、Xiaoqing Cheng、Tao Liu、Deyi Xiong

Paper:

https://arxiv.org/abs/2408.09819

Introduction

In recent years, large language models (LLMs) have made significant strides in natural language understanding and generation. However, the ethical and moral implications of their outputs remain a critical concern. As LLMs become more integrated into real-world applications, ensuring their alignment with societal values and norms is paramount. This paper introduces CMoralEval, a comprehensive benchmark designed to evaluate the moral reasoning capabilities of Chinese LLMs. The dataset is derived from diverse sources, including a Chinese TV program and various newspapers and academic papers, to ensure authenticity and diversity.

Related Work

The foundation for moral and ethical evaluation of LLMs can be traced back to the Moral Foundation Theory (MFT), which categorizes moral precepts into distinct domains. Over time, several datasets have been developed based on MFT, such as Social Chemistry 101 and ETHICS, to evaluate LLMs on various ethical dimensions. However, there has been a lack of benchmarks tailored to Chinese culture. CMoralEval aims to fill this gap by providing a dataset grounded in Chinese moral norms and societal values.

Research Methodology

Data Sources

CMoralEval encompasses two types of scenarios: explicit moral scenarios and moral dilemma scenarios. The data is sourced from:
1. A Chinese legal and ethical TV program “Observations on Morality.”
2. A collection of Chinese moral anomies from newspapers and academic papers.

Morality Taxonomy

The dataset is categorized into five moral dimensions within Chinese society:
1. Familial Morality
2. Social Morality
3. Professional Ethics
4. Internet Ethics
5. Personal Morality

These categories are inspired by traditional Confucianism and national moral initiatives, ensuring a comprehensive representation of Chinese moral norms.

Fundamental Moral Principles

Five fundamental moral principles are defined to guide the evaluation:
1. Goodness
2. Filial Piety
3. Ritual
4. Diligence
5. Innovation

These principles are rooted in traditional Chinese cultural values and serve as criteria for evaluating the correctness of options in specific scenarios.

Experimental Design

Data Annotation and Quality Control

A comprehensive annotation platform was established to facilitate the efficient construction and annotation of instances in CMoralEval. The annotation process involved generating basic scenes from the data sources, extracting narrators and Roles of Thumb (RoT), and creating new scenes for moral dilemma templates. ChatGPT-3.5 was used to assist in generating options for each scene. Stringent quality control measures were implemented to ensure the integrity and reliability of the dataset.

Dataset Statistics

The final dataset comprises 30,388 instances, categorized into five moral dimensions. The distribution of instances across categories and the average length of questions are detailed in Table 2.

Results and Analysis

Overall Performance

Extensive experiments were conducted on 26 open-source Chinese LLMs, ranging from 0.7B to 34B parameters. The evaluation was performed under both zero-shot and few-shot settings using the lm-evaluation-harness framework. The results indicate that the Yi-34B-Chat model demonstrates the best overall performance, particularly in explicit moral scenarios. However, the performance of most LLMs hovers around the vicinity of random guessing, indicating significant room for improvement in moral reasoning capabilities.

Performance Across Categories

The analysis of LLMs across different moral categories reveals that larger models tend to perform better, especially in Familial Morality and Social Morality. This suggests that comprehensive training data capturing collective moral concepts contribute to improved performance.

Single-Category vs Multi-Category Questions

The dataset includes both single-category and multi-category questions. The results show that LLMs perform better on single-category questions, indicating that they are more attuned to the moral nuances in more personally relatable domains.

Consistency of LLMs

The controlled experiments reveal that LLMs exhibit low accuracy rates across various scenarios, indicating a lack of consistency in moral reasoning. This suggests inherent limitations in the models’ capabilities to maintain uniform performance under varying conditions.

Overall Conclusion

CMoralEval provides a comprehensive benchmark for evaluating the moral reasoning capabilities of Chinese LLMs. The dataset reveals significant disparities and underperformance in current Chinese LLMs, highlighting the need for further research and development in this area. The high-quality dataset, produced under stringent annotation standards, offers a valuable resource for advancing the ethical alignment of LLMs with societal values and norms.

The dataset is publicly available at CMoralEval GitHub Repository.

By providing a detailed interpretive blog, this article aims to offer insights into the development and evaluation of CMoralEval, emphasizing its significance in the context of Chinese LLMs and their alignment with societal values.

Datasets:

ETHICS

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models

Authors:

Paper:

Introduction

Related Work

Research Methodology

Data Sources

Morality Taxonomy

Fundamental Moral Principles

Experimental Design

Data Annotation and Quality Control

Dataset Statistics

Results and Analysis

Overall Performance

Performance Across Categories

Single-Category vs Multi-Category Questions

Consistency of LLMs

Overall Conclusion

Datasets:

Related Posts