MV-MOS: Multi-View Feature Fusion for 3D Moving Object Segmentation

Authors:

Jintao Cheng、Xingming Chen、Jinxin Liang、Xiaoyu Tang、Xieyuanli Chen、Dachuan Li

Paper:

Introduction

The accurate identification of moving objects in 3D point cloud data is a critical task for autonomous driving and robotics. This task, known as Moving Object Segmentation (MOS), involves distinguishing moving objects from static entities in the environment. Traditional methods for MOS can be broadly classified into 3D voxel-based and 2D projection-based approaches. However, these methods face significant challenges, such as high computational demands and information loss during 3D-to-2D projection. To address these issues, the paper proposes a novel multi-view MOS model (MV-MOS) that fuses motion-semantic features from different 2D representations of point clouds.

Related Work

Non-Learning-Based Methods

Non-learning-based methods for MOS, such as occupancy map methods and visibility-based theories, do not require complex data preprocessing or long training times. However, their segmentation accuracy is generally lower compared to deep learning-based methods.

Learning-Based Methods

State-of-the-art learning-based methods include:
– Cylinder3D: Uses cylindrical voxelization and sparse 3D convolution.
– InsMOS and 4DMOS: Introduce time series into 3D space and use 4D convolution.
– LMNet: Uses 2D projection in the range view to construct residual maps.
– MotionSeg3D and MF-MOS: Built upon 2D representations and incorporate semantic branches.
– MotionBEV: Uses the BEV perspective for 2D projection.

These methods, while effective, still face challenges such as high computational burden and information loss in single-view 2D representations.

Research Methodology

Data Preprocessing

The original LiDAR point cloud data is transformed into range view and BEV presentations to derive the projected mappings. The range view and BEV residual maps are constructed using specific formulas to capture motion information from different perspectives.

Network Structure

The proposed MV-MOS model features a dual-view and multi-branch structure:
1. Motion Branch: Combines motion features from BEV and range view representations.
2. Semantic Branch: Provides supplementary semantic features and guides the motion branch.
3. Mamba-Based Feature Fusion Module: Fuses semantic and motion features to generate synthesized features for accurate segmentation.

Experimental Design

Experiment Setup

The MV-MOS model is trained and evaluated on the SemanticKITTI dataset, which includes semantic labels for various objects in real-world driving scenarios. The experiments are conducted using NVIDIA RTX 4090 and Tesla V100 GPUs, with specific training parameters such as learning rate, batch size, and optimizer settings.

Evaluation Metrics

The Intersection over Union (IoU) metric is used to quantify the performance of the proposed approach in all experiments.

Results and Analysis

Evaluation Results and Comparisons

The proposed MV-MOS model achieves the highest accuracy of 78.5% on the validation set and 80.6% on the test set of the SemanticKITTI-MOS benchmark, outperforming state-of-the-art models.

Ablation Studies

Ablation experiments demonstrate the effectiveness of the individual structures and modules in the MV-MOS model. The results show that the combination of the semantic branch, BR-Motion-Branch, and Mamba Block significantly improves segmentation accuracy.

Qualitative Analysis

Qualitative analysis through visualization shows that the MV-MOS model correctly infers more points and provides more complete segmentation of objects compared to other models.

Computational Efficiency

The MV-MOS model demonstrates competitive inference time, making it suitable for real-time processing in practical applications.

Overall Conclusion

The MV-MOS model introduces a novel approach to 3D moving object segmentation by effectively fusing motion and semantic features from multiple 2D representations. The dual-view and multi-branch structure, combined with the Mamba-based feature fusion module, enhances the model’s capability to capture rich and complementary information. Comprehensive experiments validate the effectiveness and generalization of the proposed model, making it a promising solution for autonomous driving and robotics applications.

Datasets:

SemanticKITTI

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

MV-MOS: Multi-View Feature Fusion for 3D Moving Object Segmentation

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

MV-MOS: Multi-View Feature Fusion for 3D Moving Object Segmentation

Authors:

Paper:

Introduction

Related Work

Non-Learning-Based Methods

Learning-Based Methods

Research Methodology

Data Preprocessing

Network Structure

Experimental Design

Experiment Setup

Evaluation Metrics

Results and Analysis

Evaluation Results and Comparisons

Ablation Studies

Qualitative Analysis

Computational Efficiency

Overall Conclusion

Datasets:

Related Posts