Authors:

Narayanan PP、Anantharaman Palacode Narayana Iyer

Paper:

HySem: A Context Length Optimized LLM Pipeline for Unstructured Tabular Extraction

Introduction

In the pharmaceutical industry, regulatory compliance reporting often involves detailed tables that are under-utilized beyond their initial compliance purposes due to their unstructured format. Extracting and semantically representing this tabular data is challenging due to the diverse presentations of tables. Large Language Models (LLMs) have shown potential for semantic representation but face challenges related to accuracy and context size limitations. This study introduces HySem, a pipeline that employs a novel context length optimization technique to generate accurate semantic JSON representations from HTML tables. This approach is particularly designed for small and medium pharmaceutical enterprises, focusing on cost and privacy considerations.

Related Work

Challenges in Tabular Data Extraction

Tables are a fundamental method for presenting structured information across various industries, including regulatory information in the pharmaceutical industry and financial reports for businesses. Some inherent challenges in tables include heterogeneity, sparsity, context-based interconnection, and lack of prior knowledge. Recent advancements in LLMs have shown promise in addressing these challenges for various table-related tasks such as entity matching, data imputation, tabular Q&A, schema augmentation, serialization, table manipulation, table understanding, prompt engineering, and table RAG.

Existing Solutions

Several works have investigated the performance of LLMs for different input tabular formats such as JSON, HTML tables, and Markdown. However, extracting complex tabular data from semi-structured sources and providing a semantic representation remains a hard problem due to the lack of standards in tabular data representations. The language understanding ability of LLMs presents an immense opportunity to develop semantics-driven representations, but LLMs are prone to hallucinations, accuracy challenges, and syntax errors.

Research Methodology

Key Goals

The research goals for HySem were formulated in accordance with industry needs, particularly for small and medium pharmaceutical enterprises:
1. On-Premise Model Deployment: Models should be run on-premise due to data security considerations.
2. Commodity Hardware Compatibility: Models should operate on commodity hardware with widely available GPUs.
3. Model Size Limitation: The model size should be less than 10 billion parameters to ensure compatibility with GPUs supporting 16 GB RAM.
4. Open-Source Software Stack: The software stack should leverage open-source tools and models.
5. Maximized Extraction Accuracy: Extraction accuracy should be maximized to reduce costly human correction efforts.

Contributions

Context Optimizer: A novel context optimizer that rewrites input text to significantly reduce the token count.
Semantic Synthesizer: A custom fine-tuned model that generates accurate semantic representations of HTML tables.
Automated Syntax Correction: An automated correction system to address syntax errors.
Evaluation Methodology: A novel evaluation methodology that measures the performance of the pipeline using both intrinsic and extrinsic metrics.

Experimental Design

HySem Pipeline Components

HySem is composed of three main components: Context Optimizer Subsystem (ACO), Semantic Synthesizer (ASS), and Syntax Validator (ASV).

Context Optimizer Subsystem

The limited context size of LLMs constrains their ability to effectively process extensive tables. The Context Optimizer employs a novel pre-processing methodology for optimizing the context window utilized by HTML tables, enabling the processing of large tables. It includes an encoding phase that minimizes token sequences and a decoding phase that restores the original lexicon used in the table.

Semantic Synthesizer

HTML tables are highly varied and arbitrary, featuring elements such as row spans, column spans, multi-level headers, nested tables, and diverse data types. The Semantic Synthesizer uses an LLM fine-tuned with a manually labeled dataset to transform HTML tables into semantic JSON.

Syntax Validator

Syntax errors in the LLM-generated JSON output render the table unusable for further processing. The Syntax Validator uses a reflective agentic framework to iteratively refine the JSON output until a syntactically valid result is achieved.

Dataset and Fine-Tuning

The dataset for fine-tuning the Semantic Synthesizer includes HTML tables and their corresponding semantic JSON annotations. Open-sourced datasets such as PubTabNet and FinTabNet were used, along with manually curated annotations for proprietary customer-supplied data.

Results and Analysis

Evaluation Methodology

The evaluation of HySem involves intrinsic and extrinsic methods to measure content and semantic accuracy.

Intrinsic Evaluation

Intrinsic evaluation assesses the representation of content from HTML cells in the generated JSON. The Intrinsic Score (ISC) is computed based on the presence of HTML cell contents in the JSON output.

Extrinsic Evaluation

Extrinsic evaluation assesses the semantic structure of the JSON by evaluating its ability to answer targeted questions. The Extrinsic Score (ESC) is computed based on the accuracy of answers generated from the JSON output.

Performance Comparison

HySem was benchmarked against popular open-source models and GPT-4o. HySem achieved over 25% higher accuracy on the intrinsic score and 3.97% higher on the extrinsic score compared to Meta-Llama-3-8B-Instruct. The Context Optimizer showed impressive reductions in token count, resulting in a reduced LLM inference throughput.

Overall Conclusion

HySem successfully implements a context length optimized LLM pipeline for generating semantic JSON output from HTML tables. The Context Optimizer significantly improves context utilization, enabling the processing of larger tables. HySem offers substantial advantages such as on-premise deployment on commodity hardware, cost-effectiveness, and enhanced data privacy. Future work includes supporting even larger tables, improving processing speed, and building custom models for other verticals.

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

HySem: A context length optimized LLM pipeline for unstructured tabular extraction

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

HySem: A context length optimized LLM pipeline for unstructured tabular extraction

Authors:

Paper:

HySem: A Context Length Optimized LLM Pipeline for Unstructured Tabular Extraction

Introduction

Related Work

Challenges in Tabular Data Extraction

Existing Solutions

Research Methodology

Key Goals

Contributions

Experimental Design

HySem Pipeline Components

Context Optimizer Subsystem

Semantic Synthesizer

Syntax Validator

Dataset and Fine-Tuning

Results and Analysis

Evaluation Methodology

Intrinsic Evaluation

Extrinsic Evaluation

Performance Comparison

Overall Conclusion

Related Posts