Subscribe to Updates
Subscribe to get the latest content in real time.
Author: James Smith
Authors: Thao Nguyen, Jeffrey Li, Sewoong Oh, Ludwig Schmidt, Jason Weston, Luke Zettlemoyer, Xian Li ArXiv: http://arxiv.org/abs/2408.04614v1 Abstract: We propose a new method, instruction back-and-forth translation, to construct high-quality synthetic data grounded in world knowledge for aligning large language models (LLMs). Given documents from a web corpus, we generate and curate synthetic instructions using the backtranslation approach proposed by Li et al.(2023a), and rewrite the responses to improve their quality further based on the initial documents. Fine-tuning with the resulting (backtranslated instruction, rewritten response) pairs yields higher win rates on AlpacaEval than using other common instruction datasets such as Humpback, ShareGPT, Open Orca, Alpaca-GPT4 and Self-instruct. We also…
1. Abstract This paper conducts a comprehensive survey on data quality (DQ) evaluation and improvement tools for machine learning (ML). It emphasizes the critical role of high-quality data in ML model performance, fairness, robustness, safety, and scalability. The paper introduces four DQ dimensions (intrinsic, contextual, representational, and accessibility) and twelve metrics specific to ML, providing definitions and examples.The survey reviews seventeen open-source DQ tools released in the past five years, analyzing their strengths and limitations based on the DQ dimensions and metrics. It proposes a roadmap for developing new DQ tools, highlighting the importance of integrating automation, monitoring, and AI…
Abstract This paper evaluates the effectiveness of large language models (LLMs), specifically ChatGPT, in performing machine translation (MT) tasks across a diverse range of languages. Using the FLORES-200 benchmark, the authors compare ChatGPT’s performance with traditional MT models like NLLB, as well as commercial systems like Google Translate and the more advanced GPT-4. The results reveal that while ChatGPT demonstrates competitive performance for high-resource languages (HRLs), it consistently falls short for low-resource languages (LRLs), underperforming traditional MT models in 84.1% of the languages evaluated. The study also highlights the limited effectiveness of few-shot prompts for improving LLM MT performance and…
1. Abstract The paper explores the complexities and challenges of engineering Retrieval Augmented Generation (RAG) systems, which leverage large language models (LLMs) to generate answers by retrieving relevant information from a data store. By presenting a mixed-methods approach that combines empirical experiments and case studies, the authors provide valuable insights into the limitations and failure points of RAG systems. They identify seven key failure points, including missing content, missed top-ranked documents, context limitations, extraction failures, format errors, incorrect specificity, and incomplete answers. The paper emphasizes the importance of continuous calibration, configuration, and testing for RAG systems, highlighting their evolving nature.…
1. Abstract Objective: This study investigates the potential of voice analysis as a tool for prescreening or monitoring Type 2 Diabetes Mellitus (T2DM) by examining differences in voice recordings between non-diabetic and T2DM individuals.Methods: A total of 267 participants (79 women and 113 men non-diabetic, 18 women and 57 men T2DM) were recruited in India. Using a smartphone application, participants recorded a fixed phrase up to 6 times daily for 2 weeks, resulting in 18,465 recordings. Fourteen acoustic features were extracted from each recording to analyze differences between groups and create prediction models for T2DM status.Results: Significant differences were found…