Adaptive Knowledge Distillation for Classification of Hand Images using Explainable Vision Transformers

Authors:

Thanh Thi Nguyen、Campbell Wilson、Janis Dalins

Paper:

Introduction

Biometric systems have become an integral part of modern security and identification systems. Among various biometric features, hand images are particularly valuable due to their unique and stable characteristics, such as vein patterns, fingerprints, and hand geometry. This paper explores the application of Vision Transformers (ViTs) for the classification of hand images, leveraging their advanced capabilities in image processing. The study also introduces adaptive knowledge distillation methods to enhance the performance of ViTs across different domains, addressing the challenge of catastrophic forgetting.

Related Work

Traditional Methods in Hand Image Classification

Previous studies have employed various feature extraction methods and classifiers for hand image classification. Techniques such as ripplet-I transform, discrete cosine transform, and local binary patterns (LBP) have been used in conjunction with classifiers like SVM and CNNs. For instance, Hardalac et al. [18] utilized LBP features and Euclidean distance classifiers for palm print verification, while Afifi [1] introduced a large dataset of hand images and used CNN and LBP for feature extraction and SVM for classification.

Vision Transformers in Biometric Applications

Vision Transformers (ViTs) have shown superior performance in image processing tasks. Studies like [16] have demonstrated the effectiveness of ViTs in vein biometric recognition. The ViT models, such as Google ViT, DeiT, BEiT, DINOv2, Swin Transformer V2, and ViT-MAE, have been explored for their capabilities in handling large-scale image datasets and providing explainable AI solutions.

Knowledge Distillation and Domain Adaptation

Knowledge distillation, introduced by Hinton et al. [20], involves transferring knowledge from a larger teacher model to a smaller student model. This technique is valuable for model compression and deployment on smaller devices. However, traditional knowledge distillation methods face challenges when the target domain significantly differs from the source domain. This study proposes adaptive distillation methods to address these challenges, enabling the student model to retain knowledge from the source domain while adapting to the target domain.

Research Methodology

Vision Transformer Models

The study investigates six ViT variants:

Google ViT: Pre-trained on ImageNet-21k and fine-tuned on ImageNet.
DeiT: Data-efficient image Transformer pre-trained and fine-tuned on ImageNet-1k.
BEiT: Pre-trained on ImageNet-21k using masked patches and fine-tuned on ILSVRC2012.
DINOv2: Self-distillation model trained on extensive curated data without supervision.
Swin Transformer V2: Constructs hierarchical feature maps and uses scaled cosine attention.
ViT-MAE: Pre-trained using masked autoencoders with a focus on reconstructing raw pixel values.

Explainability Tools

To understand the internal representations of ViTs, the study employs Deep Feature Factorization (DFF) and Gradient-weighted Class Activation Mapping (Grad-CAM). These tools provide insights into the features and regions of images that contribute to the model’s predictions.

Experimental Design

Datasets

IIT Delhi Dataset: Comprises touchless palm print images from 230 subjects, captured at a resolution of 1600×1200.
11k Hands Dataset: Contains 11,076 hand images from 190 subjects, including both palmar and dorsal sides, with detailed metadata.

Experimental Setup

The experiments involve fine-tuning ViT models on the source domain and evaluating their performance on the target domain. The study conducts several scenarios, such as transferring models from left to right hand images and from palm to dorsal side images. The performance is measured using classification accuracy, and the experiments are repeated multiple times to ensure robustness.

Results and Analysis

ViT Models vs Traditional Methods

The results demonstrate that ViT models significantly outperform traditional methods in hand image classification. For instance, in the IIT Delhi dataset, ViT models achieve an accuracy of at least 0.982, compared to the best existing accuracy of 0.940 from traditional methods. Similarly, in the 11k hands dataset, ViT models exhibit superior performance across various experimental scenarios.

Explainability Results

The explainability tools reveal valuable insights into the internal states of ViT models. For example, the Grad-CAM method shows that the ViT model identifies subjects based on features like nail polish, while the DFF method highlights the importance of hand shape and finger dimensions.

Transfer Inference

The transfer inference experiments show varying levels of accuracy when ViT models are transferred to different domains. Transferring from left to right hand images achieves better accuracy compared to transferring from palm to dorsal side images, indicating the challenges posed by significant domain differences.

Domain Adaptation

The adaptive distillation methods proposed in the study demonstrate significant improvements in mitigating catastrophic forgetting. Method 2, which involves imitating the teacher’s internal states, shows superior performance in retaining knowledge from the source domain while adapting to the target domain.

Overall Conclusion

The study successfully demonstrates the effectiveness of Vision Transformers in hand image classification and introduces adaptive knowledge distillation methods to address the challenge of catastrophic forgetting. The proposed methods enable ViT models to achieve high performance on both source and target domains, making them suitable for real-world applications such as access control and identity verification. Future work can explore the application of these methods to other biometric data types, such as fingerprints and iris recognition.

By implementing robust regulations, security measures, and transparency mechanisms, it may be possible to harness the benefits of using our hand image classification methods while minimizing its ethical drawbacks in police surveillance contexts.

Datasets:

11k Hands

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Adaptive Knowledge Distillation for Classification of Hand Images using Explainable Vision Transformers

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

Adaptive Knowledge Distillation for Classification of Hand Images using Explainable Vision Transformers

Authors:

Paper:

Introduction

Related Work

Traditional Methods in Hand Image Classification

Vision Transformers in Biometric Applications

Knowledge Distillation and Domain Adaptation

Research Methodology

Vision Transformer Models

Explainability Tools

Experimental Design

Datasets

Experimental Setup

Results and Analysis

ViT Models vs Traditional Methods

Explainability Results

Transfer Inference

Domain Adaptation

Overall Conclusion

Datasets:

Related Posts