Multilingual Non-Factoid Question Answering with Silver Answers

Authors:

Ritwik Mishra、Sreeram Vennam、Rajiv Ratn Shah、Ponnurangam Kumaraguru

Paper:

Introduction

In the realm of Question Answering (QA) systems, most existing datasets focus on factoid-based short-context questions, predominantly in high-resource languages. However, there is a significant gap when it comes to non-factoid questions, especially in low-resource languages. This study introduces MuNfQuAD, a multilingual QA dataset designed to address this gap by focusing on non-factoid questions. The dataset leverages interrogative subheadings from BBC news articles as questions and the corresponding paragraphs as silver answers, encompassing over 370K QA pairs across 38 languages.

Related Work

Existing QA Datasets

Several QA datasets have been developed over the years, primarily focusing on factoid questions:
– WikiQA: Extracted questions from Bing query logs and matched them with Wikipedia articles.
– SQuAD: A benchmark dataset where crowdworkers generated questions based on Wikipedia passages.
– Natural Questions: The largest factoid-based dataset, including long and short answers with Wikipedia articles as context.

Multilingual QA Datasets

Efforts in multilingual QA datasets include:
– bAbI: Contained factoid-based questions in English and Hindi.
– XQA: Gathered questions from Wikipedia’s “Did you know?” boxes.
– MLQA: Generated questions from English Wikipedia articles and translated them into multiple languages.
– TyDi QA: Focused on natural questions in multiple languages, encouraging curiosity-driven questions.

MuNfQuAD stands out by providing a large-scale multilingual dataset specifically for non-factoid questions, filling a critical gap in existing resources.

Research Methodology

Data Collection

The dataset was curated by scraping BBC news articles in multiple languages. The process involved:
1. Scraping Articles: Using Python libraries to scrape articles from the BBC website and the Wayback Machine.
2. Identifying Questions: Extracting interrogative subheadings as questions.
3. Extracting Answers: Using the paragraphs following the interrogative subheadings as silver answers.

Silver Labels

The dataset relies on the hypothesis that paragraphs succeeding an interrogative subheading contain the answer, referred to as silver labels. This approach has been validated in other domains like legal and medical text classification.

Experimental Design

Dataset Statistics

MuNfQuAD comprises over 329,000 unique QA pairs, making it the largest multilingual QA dataset to date. The dataset includes:
– 38 languages: Covering a wide range of low-resource languages.
– Diverse Content: Articles predominantly from the Asiatic subcontinent.

Answer Paragraph Selection (APS) Model

The APS model is designed to identify paragraphs that answer a given question. The model architecture includes:
1. Input: Concatenation of a question and a candidate paragraph.
2. Output: Probability score indicating the likelihood of the paragraph being an answer.
3. Training: Fine-tuning multilingual pretrained encoders like XLM-Roberta and mBERT.

Baselines

Baseline models include:
– Sentence Transformers: Using vector embeddings to calculate cosine similarity between questions and paragraphs.
– TF-IDF Vectorizer: Training on the dataset to derive confidence scores based on word overlap.

Results and Analysis

Model Performance

The APS model fine-tuned with the XLM-V encoder demonstrated the highest macro F1 and Label-1 F1 scores, outperforming baseline models. The model achieved:
– Accuracy: 80%
– Macro F1: 72%
– Label-1 F1: 56%

Evaluation on Golden Set

A subset of the dataset was manually annotated to create a golden set. The evaluation revealed:
– Success Rate: 98% of questions could be answered using their silver answers.
– Model Generalization: The APS model effectively generalized to certain languages within the golden set.

Large Language Models (LLMs)

Experiments with LLMs like ChatGPT and BLOOM showed that while LLMs can be used as APS models, they require substantial computational resources and may not outperform fine-tuned APS models.

Overall Conclusion

MuNfQuAD addresses a significant gap in multilingual QA datasets by focusing on non-factoid questions. The dataset, along with the fine-tuned APS model, provides a valuable resource for developing and evaluating QA systems in low-resource languages. Future work includes exploring generative techniques for QA and developing multilingual answer span extractors.

Illustrations

Data Collection and Model Fine-Tuning

Figure 1: An illustration depicting the data collection process and fine-tuning of the APS model.

Multilingual QA Dataset Statistics

Table 1: Attributes of different multilingual QA datasets.

MuNfQuAD Statistics

Table 2: Overview of MuNfQuAD statistics.

Frequent Entities in Questions

Table 4: Most frequent entities found in translated English MuNfQuAD questions.

Frequent N-grams in Questions

Table 3: Most frequent n-grams in translated English MuNfQuAD questions.

Model Performance on Golden Set

Table 6: Performance of silver labels and best performing APS model on the golden set.

Comparative Performance of APS Models

Table 5: Comparative performance of various models on the MuNfQuAD Test Set for APS task.

Question Category Distribution

Figure 2: Distribution of question categories in MuNfQuAD and NLQuAD.

MuNfQuAD represents a significant advancement in the field of multilingual QA, providing a robust dataset and model for non-factoid questions across a diverse range of languages.

Datasets:

SQuAD、Natural Questions、XQuAD、MLQA、XQA

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Multilingual Non-Factoid Question Answering with Silver Answers

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

Multilingual Non-Factoid Question Answering with Silver Answers

Authors:

Paper:

Introduction

Related Work

Existing QA Datasets

Multilingual QA Datasets

Research Methodology

Data Collection

Silver Labels

Experimental Design

Dataset Statistics

Answer Paragraph Selection (APS) Model

Baselines

Results and Analysis

Model Performance

Evaluation on Golden Set

Large Language Models (LLMs)

Overall Conclusion

Illustrations

Data Collection and Model Fine-Tuning

Multilingual QA Dataset Statistics

MuNfQuAD Statistics

Frequent Entities in Questions

Frequent N-grams in Questions

Model Performance on Golden Set

Comparative Performance of APS Models

Question Category Distribution

Datasets:

Related Posts