R2GenCSR: Retrieving Context Samples for Large Language Model based X-ray Medical Report Generation

Authors:

Xiao Wang、Yuehang Li、Fuling Wang、Shiao Wang、Chuanfu Li、Bo Jiang

Paper:

Introduction

X-ray image-based medical report generation is a critical application of Artificial Intelligence (AI) in healthcare. The goal is to leverage AI models to generate high-quality medical reports from X-ray images, thereby reducing the workload of doctors and minimizing patient waiting times. Despite significant advancements, current models still struggle to match the expertise of professional physicians due to challenges such as data privacy concerns, lack of quality and diversity in training datasets, and the rarity of certain diseases. This paper introduces a novel context-guided efficient X-ray medical report generation framework, R2GenCSR, which aims to address these challenges by enhancing feature representation and discriminative learning.

Related Work

X-ray Medical Report Generation

Existing models for X-ray medical report generation can be categorized into CNN-based, RNN-based, and Transformer-based frameworks. For instance, Li et al. combined CNNs and RNNs to generate reports from chest X-ray images, while Jing et al. used an LSTM framework to predict diseases and generate reports. Transformer models have also been employed to leverage visual and text data for improved performance.

Large Language Models

Large Language Models (LLMs) have shown great promise in natural language processing tasks. Notable examples include Google’s BERT, Meta’s LLaMA, and OpenAI’s GPT series. In the medical domain, models like R2Gen-GPT and MedicalGPT have been developed to generate medical reports by combining image and text data.

State Space Models

State Space Models (SSMs) like S4 and Mamba have been proposed to achieve linear complexity in sequence modeling. These models have been successfully applied to various research fields, including vision-related tasks in medical report generation.

Context Sample Retrieval

Retrieval-Augmented Generation (RAG) combines retrieval and generation to enhance performance in tasks requiring extensive knowledge and contextual information. Techniques like BG-Former and CricaVPR have been used to mine relationships between samples and generate robust image features. In medical report generation, RAG can serve as a clinical decision support tool by combining medical databases and research papers.

Research Methodology

Preliminary: Mamba

The Mamba network is based on the continuous State Space Model (SSM), which maps a one-dimensional function or sequence through a hidden state. The discrete version of SSM is achieved using the Zero-Order Hold (ZOH) method. Mamba enhances the SSM by making the model time-varying and speeding up training and inference using hardware-aware algorithms.

Overview

The proposed R2GenCSR framework consists of three main parts: the Mamba vision backbone, context retrieval module, and large language model (LLM) for report generation. The framework extracts visual tokens from the input X-ray image using the Mamba backbone, retrieves context samples from the training subset, and generates high-quality medical reports by feeding the vision tokens, context residual tokens, and prompt statements into the LLM.

R2GenCSR Framework

Input Representation

The dataset contains N X-ray images, each represented by a feature map obtained using the VMamba backbone. Two distinct types of representations are generated: global features and sequential tokens.

Context Sample Retrieval

Context samples are retrieved based on keywords in the medical reports. Positive and negative samples are selected using approaches like CheXbert and keyword matching. The residual tokens, which measure the difference between the input and context samples, are calculated and used to guide the LLM in generating accurate medical reports.

LLM Head & Loss Function

The LLM generates detailed and accurate medical reports from the composite prompt, which includes visual and textual residuals. The cross-entropy loss function measures the difference between the generated reports and ground truth annotations, optimizing the R2GenCSR framework.

Experimental Design

Datasets and Evaluation Metrics

The performance of the model is evaluated on three datasets: IU-XRay, MIMIC-CXR, and CheXpert Plus. The evaluation metrics include CIDEr, BLEU, ROUGE-L, and METEOR, which measure the quality of the generated reports by comparing them to reference texts.

Implementation Details

The input X-ray image is resized to 224 × 224, and beam search is used for report generation. The training procedure is conducted on a server with NVIDIA A800 80GB GPUs using mixed precision. The model is trained for 20 and 25 epochs on the MIMIC-CXR and IU-Xray datasets, respectively.

Results and Analysis

Comparison on Public Benchmarks

Results on IU-Xray Dataset

The proposed R2GenCSR framework demonstrates competitive performance, achieving a BLEU-4 score of 0.206, surpassing existing methods. The framework also attains high precision in ROUGE-L, METEOR, and CIDEr scores.

Results on MIMIC-CXR Dataset

The method achieves a BLEU-1 score of 0.420, a BLEU-4 score of 0.136, and a ROUGE-L score of 0.291, indicating its ability to generate precise and contextually relevant medical reports.

Results on CheXpert Plus Dataset

The R2GenCSR-Llama2 model outperforms the R2Gen-GPT model, improving on all four evaluation metrics, validating the effectiveness of the proposed modules.

Component Analysis

A detailed component analysis on the IU-Xray dataset investigates the impact of key modules, including VMamba, context information, fixed pair strategy, and different language models. The results confirm that the VMamba module outperforms the Swin Transformer in extracting efficient visual features.

Ablation Study

Analysis of Different VMamba Backbones

The Base version of VMamba demonstrates the best overall performance, confirming that a larger VMamba backbone can capture more intricate visual features.

Analysis of Different Context Sample Retrieve Methods

The Keyword method retrieves samples based on keyword matching and is the most effective approach, surpassing other methods on all evaluated metrics.

Analysis of Different Stages for Feature Subtraction

Feature subtraction in the LLM projection space yields better results, demonstrating the importance of feature subtraction at different stages.

Analysis of Different Resolutions for Report Generation

The performance metrics improve as the resolution decreases, indicating that the VMamba model is adept at processing and extracting meaningful features from images of specific resolutions.

Analysis of Different LLMs for Report Generation

The Qwen1.5-1.8B model demonstrates notable improvement in metrics, while the specialized MedicalGPT model exhibits competitive performance, highlighting the benefits of domain-specific fine-tuning.

Analysis of Different Number of Context Samples

An optimal point is found at three context sample pairs, indicating that a moderate amount of context enhances the learning process.

Analysis of VMamba and Swin-Transformer for Report Generation

The VMamba model demonstrates superior efficiency compared to the Swin Transformer, despite having slightly more trainable parameters and memory consumption.

Analysis of Positive-Negative Context-Instructed Text and Instruction Prompt

Optimizing prompts can subtly affect the model’s output, with different instruction prompts resulting in slight variations in performance metrics.

Visualization

Feature Maps

The VMamba vision backbone effectively extracts discriminative visual features from X-ray images, emphasizing regions of interest such as lesions and organs.

X-ray Medical Report

The generated reports closely align with the ground truth, with minor discrepancies highlighted in mismatching terms.

Overall Conclusion

The R2GenCSR framework introduces a novel context-guided efficient X-ray medical report generation approach, leveraging the Mamba model as a linear complexity vision backbone. The integration of vision tokens, context information, and prompt statements facilitates the generation of high-quality medical reports by the LLM. Extensive evaluations on the IU-Xray, MIMIC-CXR, and CheXpert Plus datasets demonstrate the efficacy and state-of-the-art performance of the proposed framework. This work contributes to the advancement of automated medical report generation and provides valuable insights for leveraging LLMs in other domains.

Code:

https://github.com/event-ahu/medical_image_analysis

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

R2GenCSR: Retrieving Context Samples for Large Language Model based X-ray Medical Report Generation

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

R2GenCSR: Retrieving Context Samples for Large Language Model based X-ray Medical Report Generation

Authors:

Paper:

Introduction

Related Work

X-ray Medical Report Generation

Large Language Models

State Space Models

Context Sample Retrieval

Research Methodology

Preliminary: Mamba

Overview

R2GenCSR Framework

Input Representation

Context Sample Retrieval

LLM Head & Loss Function

Experimental Design

Datasets and Evaluation Metrics

Implementation Details

Results and Analysis

Comparison on Public Benchmarks

Results on IU-Xray Dataset

Results on MIMIC-CXR Dataset

Results on CheXpert Plus Dataset

Component Analysis

Ablation Study

Analysis of Different VMamba Backbones

Analysis of Different Context Sample Retrieve Methods

Analysis of Different Stages for Feature Subtraction

Analysis of Different Resolutions for Report Generation

Analysis of Different LLMs for Report Generation

Analysis of Different Number of Context Samples

Analysis of VMamba and Swin-Transformer for Report Generation

Analysis of Positive-Negative Context-Instructed Text and Instruction Prompt

Visualization

Feature Maps

X-ray Medical Report

Overall Conclusion

Code:

Related Posts