Authors:

Ritwik MishraSreeram VennamRajiv Ratn ShahPonnurangam Kumaraguru

Paper:

https://arxiv.org/abs/2408.10604

Introduction

In the realm of Question Answering (QA) systems, most existing datasets focus on factoid-based short-context questions, predominantly in high-resource languages. However, there is a significant gap when it comes to non-factoid questions, especially in low-resource languages. This study introduces MuNfQuAD, a multilingual QA dataset designed to address this gap by focusing on non-factoid questions. The dataset leverages interrogative subheadings from BBC news articles as questions and the corresponding paragraphs as silver answers, encompassing over 370K QA pairs across 38 languages.

Related Work

Existing QA Datasets

Several QA datasets have been developed over the years, primarily focusing on factoid questions:
WikiQA: Extracted questions from Bing query logs and matched them with Wikipedia articles.
SQuAD: A benchmark dataset where crowdworkers generated questions based on Wikipedia passages.
Natural Questions: The largest factoid-based dataset, including long and short answers with Wikipedia articles as context.

Multilingual QA Datasets

Efforts in multilingual QA datasets include:
bAbI: Contained factoid-based questions in English and Hindi.
XQA: Gathered questions from Wikipedia’s “Did you know?” boxes.
MLQA: Generated questions from English Wikipedia articles and translated them into multiple languages.
TyDi QA: Focused on natural questions in multiple languages, encouraging curiosity-driven questions.

MuNfQuAD stands out by providing a large-scale multilingual dataset specifically for non-factoid questions, filling a critical gap in existing resources.

Research Methodology

Data Collection

The dataset was curated by scraping BBC news articles in multiple languages. The process involved:
1. Scraping Articles: Using Python libraries to scrape articles from the BBC website and the Wayback Machine.
2. Identifying Questions: Extracting interrogative subheadings as questions.
3. Extracting Answers: Using the paragraphs following the interrogative subheadings as silver answers.

Silver Labels

The dataset relies on the hypothesis that paragraphs succeeding an interrogative subheading contain the answer, referred to as silver labels. This approach has been validated in other domains like legal and medical text classification.

Experimental Design

Dataset Statistics

MuNfQuAD comprises over 329,000 unique QA pairs, making it the largest multilingual QA dataset to date. The dataset includes:
38 languages: Covering a wide range of low-resource languages.
Diverse Content: Articles predominantly from the Asiatic subcontinent.

Answer Paragraph Selection (APS) Model

The APS model is designed to identify paragraphs that answer a given question. The model architecture includes:
1. Input: Concatenation of a question and a candidate paragraph.
2. Output: Probability score indicating the likelihood of the paragraph being an answer.
3. Training: Fine-tuning multilingual pretrained encoders like XLM-Roberta and mBERT.

Baselines

Baseline models include:
Sentence Transformers: Using vector embeddings to calculate cosine similarity between questions and paragraphs.
TF-IDF Vectorizer: Training on the dataset to derive confidence scores based on word overlap.

Results and Analysis

Model Performance

The APS model fine-tuned with the XLM-V encoder demonstrated the highest macro F1 and Label-1 F1 scores, outperforming baseline models. The model achieved:
Accuracy: 80%
Macro F1: 72%
Label-1 F1: 56%

Evaluation on Golden Set

A subset of the dataset was manually annotated to create a golden set. The evaluation revealed:
Success Rate: 98% of questions could be answered using their silver answers.
Model Generalization: The APS model effectively generalized to certain languages within the golden set.

Large Language Models (LLMs)

Experiments with LLMs like ChatGPT and BLOOM showed that while LLMs can be used as APS models, they require substantial computational resources and may not outperform fine-tuned APS models.

Overall Conclusion

MuNfQuAD addresses a significant gap in multilingual QA datasets by focusing on non-factoid questions. The dataset, along with the fine-tuned APS model, provides a valuable resource for developing and evaluating QA systems in low-resource languages. Future work includes exploring generative techniques for QA and developing multilingual answer span extractors.

Illustrations

Data Collection and Model Fine-Tuning


Figure 1: An illustration depicting the data collection process and fine-tuning of the APS model.

Multilingual QA Dataset Statistics


Table 1: Attributes of different multilingual QA datasets.

MuNfQuAD Statistics


Table 2: Overview of MuNfQuAD statistics.

Frequent Entities in Questions


Table 4: Most frequent entities found in translated English MuNfQuAD questions.

Frequent N-grams in Questions


Table 3: Most frequent n-grams in translated English MuNfQuAD questions.

Model Performance on Golden Set


Table 6: Performance of silver labels and best performing APS model on the golden set.

Comparative Performance of APS Models


Table 5: Comparative performance of various models on the MuNfQuAD Test Set for APS task.

Question Category Distribution


Figure 2: Distribution of question categories in MuNfQuAD and NLQuAD.

MuNfQuAD represents a significant advancement in the field of multilingual QA, providing a robust dataset and model for non-factoid questions across a diverse range of languages.

Datasets:

SQuADNatural QuestionsXQuADMLQAXQA

Share.

Comments are closed.

Exit mobile version