Assessment of Cell Nuclei AI Foundation Models in Kidney Pathology

Authors:

Junlin Guo、Siqi Lu、Can Cui、Ruining Deng、Tianyuan Yao、Zhewen Tao、Yizhe Lin、Marilyn Lionts、Quan Liu、Juming Xiong、Catie Chang、Mitchell Wilkes、Mengmeng Yin、Haichun Yang、Yuankai Huo

Paper:

https://arxiv.org/abs/2408.06381

Introduction

Cell nuclei instance segmentation is a fundamental task in digital pathology, particularly for accurate disease diagnosis and treatment planning. However, the generalizability of current methodologies to handle diverse and large-scale datasets remains a significant challenge. This study evaluates the performance of state-of-the-art (SOTA) cell nuclei foundation models in kidney pathology, focusing on three widely used models: Cellpose, StarDist, and CellViT.

Methods

Diverse Large-scale Dataset

A diverse evaluation dataset was created, consisting of 2,542 kidney whole slide images (WSIs) from both human and rodent sources. These WSIs were collected from public datasets such as Kidney Tissue Atlas (KPMP), NEPTUNE, and HUBMAP, as well as an in-house collection at Vanderbilt University Medical Center. The dataset includes various tissue types and staining methods, such as Hematoxylin and Eosin (H&E), Periodic acid-Schiff methenamine (PASM), and Periodic acid-Schiff (PAS).

Nuclei Instance Segmentation

Nuclei instance segmentation was performed using three foundation models: Cellpose, StarDist, and CellViT. Each model has unique features and methodologies:

Cellpose: Utilizes a U-Net backbone to predict topological maps and a binary map, retrieving individual objects through a gradient tracking algorithm.
StarDist: Employs a star-convex polygon representation for nuclei segmentation, using a U-Net architecture for dense prediction of radial distances and object probabilities.
CellViT: Uses a hierarchical encoder-decoder Vision Transformer (ViT) backbone, leveraging pre-trained weights from a ViT trained on 104 million histological images.

Rating on Segmentation Performance

Two pathologist-trained students scored the model predictions as “good,” “medium,” or “bad” based on a standard provided by a renal pathologist. This scoring system quantitatively assessed each model’s predictions, and the ratings were used to curate strongly agreed-upon failure samples for future model training or fine-tuning.

Experiments

Distribution of Foundation Model Predictions

The distribution of predictions for each foundation model was analyzed across the evaluation dataset. This analysis provided insights into the performance and behavior of each model by examining the frequency of different prediction classes (“good,” “medium,” “bad”).

Agreement Analysis Between Foundation Models

An agreement analysis was conducted to assess cross-model performance on the kidney nuclei dataset. This involved calculating agreement percentages between each pair of models and categorizing image patches based on the level of agreement among the models.

Consensus “Bad” Image Patches Mining

By curating labels from the three foundation models, strong-agreed-upon failure samples were identified. These “global failure” image patches highlight areas where current cell nuclei foundation models need improvement in kidney pathology.

Results

Distribution of Predictions

The distribution of predictions for each model revealed that at least 40% of predictions were categorized as “bad” or “medium,” while at least 40% were “good.” This indicates that the models retain some transferable knowledge applicable to kidney pathology, but a performance gap remains.

Agreement Analysis

The agreement matrix showed that no pair of models exhibited over 90% agreement or complete disagreement. The highest agreement was observed between Cellpose and StarDist (0.76), while the lowest was between Cellpose and CellViT (0.67). This inconsistency suggests that combining multiple SOTA foundation models could enhance performance in downstream kidney nuclei tasks.

Consensus “Good” and “Failure” Examples

Consensus “good” examples showed that current nuclei foundation models perform well on image patches with darker nuclei and higher contrast. Conversely, consensus “failure” examples highlighted challenges with lower staining intensity, lighter nuclei, and reduced imaging contrast.

Conclusion

This study provides a comprehensive assessment of cell nuclei foundation models (Cellpose, StarDist, CellViT) in kidney pathology, focusing on nuclei instance segmentation. CellViT demonstrated superior performance, but a performance gap remains between general and kidney-specific nuclei segmentation. The curated consensus “good” and “failure” image patches offer valuable insights for improving cell nuclei foundation models in kidney pathology.

Acknowledgments

This work was supported by the National Institutes of Health, the National Science Foundation, and the Vanderbilt Institute for Clinical and Translational Research. Additional support was provided by NVIDIA through a hardware grant.

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Assessment of Cell Nuclei AI Foundation Models in Kidney Pathology

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

Assessment of Cell Nuclei AI Foundation Models in Kidney Pathology

Authors:

Paper:

Introduction

Methods

Diverse Large-scale Dataset

Nuclei Instance Segmentation

Rating on Segmentation Performance

Experiments

Distribution of Foundation Model Predictions

Agreement Analysis Between Foundation Models

Consensus “Bad” Image Patches Mining

Results

Distribution of Predictions

Agreement Analysis

Consensus “Good” and “Failure” Examples

Conclusion

Acknowledgments

Related Posts