Event Stream based Sign Language Translation: A High-Definition Benchmark Dataset and A New Algorithm

Authors:

Xiao Wang、Yao Rong、Fuling Wang、Jianing Li、Lin Zhu、Bo Jiang、Yaowei Wang

Paper:

Introduction

Sign Language Translation (SLT) is a crucial task in the realm of AI-assisted disability support. Traditional SLT methods rely on visible light videos, which are susceptible to issues such as lighting conditions, rapid hand movements, and privacy concerns. This paper introduces a novel approach using high-definition event streams for SLT, which effectively mitigates these challenges. Event streams offer a high dynamic range and dense temporal signals, making them resilient to low illumination and motion blur. Additionally, their spatial sparsity helps protect the privacy of the individuals being recorded.

The authors present a new high-resolution event stream sign language dataset, termed Event-CSL, which fills a significant data gap in this research area. The dataset includes 14,827 videos, 14,821 glosses, and 2,544 Chinese words in the text vocabulary, collected across various indoor and outdoor scenes. The paper also proposes a novel baseline method leveraging the Mamba model’s ability to integrate temporal information from CNN features, resulting in improved SLT outcomes.

Related Work

Sign Language Translation

Sign Language Translation has garnered significant attention in computer vision, aiming to convert sign language videos into spoken text to facilitate communication for the deaf community. Various approaches have been proposed, including:

Camgöz et al. introduced the SLT problem and proposed a Transformer-based architecture for end-to-end learning of sign language recognition and translation.
GloFE extracts common concepts from text and uses global embeddings for cross-attention lookups.
Zhou et al. proposed a two-stage approach combining contrastive language image pre-training with masking self-supervised learning.
SignNet II and GASLT are Transformer-based models enhancing text-to-sign translation performance.

State Space Model

State Space Models (SSMs) describe dynamic systems using differential or difference equations. Recent advancements include:

Rangapuram et al. utilized recurrent neural networks to parameterize linear state-space models.
Gu et al. proposed the Structured State Space (S4) sequence model and the Linear State Space Layer (LSSL).
Mamba and VMamba are sequence modeling methods addressing computational efficiency and global receptive fields.

Inspired by these works, the authors propose augmenting local CNN features using SSM for event-based SLT.

Research Methodology

Overview

The proposed event-based SLT framework involves stacking event streams into frames, feeding them into a residual network for feature extraction, and using Mamba blocks to capture long-range temporal features. The enhanced features are then processed through a temporal convolutional module and a sign embedding module before being decoded into Chinese language sentences using a Transformer-based language model.

Input Representation

Event streams are transformed into event images, resized to a fixed resolution, and processed in mini-batches during training.

Network Architecture

Hybrid CNN-Mamba Vision Backbone Network: Combines ResNet-18 for local feature extraction with Mamba layers for long-range temporal context modeling.
Sign Language Decoder: Uses a temporal convolutional module and a sign embedding module to generate sign language sentences, initialized with pre-trained mBART parameters.

Loss Function

The cross-entropy loss function is used to calculate the distance between annotated sign language and generated sentences.

Experimental Design

Event-CSL Dataset

The Event-CSL dataset is collected following protocols ensuring diversity in views, camera motions, scenes, light intensity, action continuity, and sign language content. It includes 14,827 videos, 14,821 glosses, and 2,544 Chinese words, divided into training, validation, and testing subsets.

Statistical Analysis

The dataset surpasses existing event-based sign language datasets in terms of video count, resolution, gloss, and text vocabulary. A word cloud visualization of the textual annotations is provided.

Benchmark Baselines

The performance of several SLT algorithms is reported as baselines for future comparisons, including gloss-based and gloss-free methods.

Results and Analysis

Dataset and Evaluation Metric

Experiments are conducted on four event-based SLT datasets: Event-CSL, EvSign, PHOENIX-2014T-Event, and CSL-Daily-Event. BLEU and ROUGE-L metrics are used to assess performance.

Implementation Details

The model uses the SGD optimizer with a cosine annealing scheduler, implemented in PyTorch and tested on A800 GPUs.

Comparison on Public SLT Datasets

The proposed method outperforms state-of-the-art methods on the Event-CSL, EvSign, PHOENIX-2014T-Event, and CSL-Daily-Event datasets, demonstrating its effectiveness.

Ablation Study

Component analysis on the Event-CSL dataset shows the impact of each module. The hybrid CNN-Mamba backbone significantly improves performance by modeling long-range temporal relations.

Efficiency Analysis

The proposed method is efficient in terms of parameters, GPU memory, and speed compared to Transformer-based architectures.

Visualization

Qualitative results show the generated sentences compared to reference sentences, demonstrating the model’s reliability in capturing the meaning of sign language videos.

Limitation Analysis

Future improvements include designing architectures to fully exploit high-definition event-based SLT and using larger language models for better performance.

Overall Conclusion

This study introduces a novel approach to SLT using high-definition event streams, addressing limitations of traditional methods. The Event-CSL dataset and the proposed baseline method demonstrate significant advancements in SLT, paving the way for future research and development in this field.