SSRFlow: Semantic-aware Fusion with Spatial Temporal Re-embedding for Real-world Scene Flow

Authors:

Zhiyang Lu、Qinghan Chen、Zhimin Yuan、Ming Cheng

Paper:

Introduction

Scene flow estimation is a critical task in dynamic scene perception, providing 3D motion vectors for each point in a source frame from two consecutive point clouds. This foundational component aids in various downstream tasks such as object tracking, point cloud label propagation, and pose estimation. Traditional methods often rely on stereo or RGB-D images, but recent advances in deep learning have led to end-to-end algorithms specifically designed for scene flow prediction. However, these methods face challenges in global flow embedding, handling non-rigid deformations, and generalizing from synthetic to real-world datasets.

Methodology

Problem Definition

The scene flow task aims to estimate point-wise 3D motion information between two consecutive point cloud frames. The input includes the source frame ( S = {s_i}{i=1}^N = {x_i, f_i}{i=1}^N ) and target frame ( T = {t_j}{j=1}^M = {y_j, g_j}{j=1}^M ), where ( x_i, y_j \in \mathbb{R}^3 ) are 3D coordinates, and ( f_i, g_j \in \mathbb{R}^d ) represent point features. The goal is to predict the 3D motion vector ( SF = {sf_i \in \mathbb{R}^3}_{i=1}^N ) for each source frame point.

Hierarchical Feature Extraction

The proposed network utilizes PointConv as the feature extraction backbone to build a pyramid network. The process involves Farthest Point Sampling (FPS) to extract center points, K-Nearest Neighbor (KNN) to group neighbor points, and PointConv to aggregate local features, resulting in higher-level semantic features.

Global Fusion Flow Embedding

The Global Fusion (GF) module captures the global relation between consecutive frames during flow initialization. After multi-level feature extraction, the highest-level features ( S^ ) and ( T^ ) are used for global fusion flow embedding in both semantic context space and Euclidean space. The Dual Cross Attentive (DCA) Fusion module merges semantic contexts from both frames, enhancing mutual understanding before embedding.

The DCA module employs a cross-attentive mechanism to merge semantic contexts, yielding an attentive weight map for subsequent global aggregation. The initial global flow embedding ( GFE ) is constructed from both fusion semantic context and Euclidean space, and the final global fusion flow embedding ( GFFE ) is aggregated using the attentive weights.

Warping Layer

The warping layer upsamples the coarse sparse scene flow from the previous level to obtain the coarse dense scene flow of the current level. This dense flow is accumulated onto the source frame to generate the warped source frame, bringing the source and target frames closer for subsequent residual flow estimation.

Spatial Temporal Re-embedding

After the warping layer, the spatiotemporal relation between consecutive frames changes. The Spatial Temporal Re-embedding (STR) module re-embeds temporal features between the warped source frame and target frame, along with spatial features within the warped source frame. This re-embedding is performed in a patch-to-patch manner.

Flow Prediction

The Flow Prediction (FP) module combines PointConv, MLP, and a Fully Connected (FC) layer. For each point in the source frame, its local flow embedding feature, warped coordinates, and re-embedded features are input into the module. The final output is the scene flow, regressed through the FC layer.

Training Losses

Hierarchical Supervised Loss

A supervised loss is directly hooked to the ground truth (GT) of scene flow, leveraging multi-level loss functions as supervision to optimize the model across various pyramid levels.

Domain Adaptive Losses

Local Flow Consistency (LFC) Loss

Dynamic objects in real-world scenes typically undergo local rigid motion, manifested through the consistency of local flow. The LFC loss measures the predicted flow difference between each point and its KNN+Radius points group in the source frame.

Cross-frame Feature Similarity (CFS) Loss

The semantic features of points in the warped source frame should be similar to those in the surrounding target frame. The CFS loss penalizes points that exhibit a similarity lower than a specified threshold.

The final loss of the model is a combination of the supervised loss, LFC loss, and CFS loss.

Experiments

Datasets and Data Preprocessing

Experiments were performed on four datasets: the synthetic dataset FlyThings3D (FT3D) and three real-world datasets including Stereo-KITTI, SF-KITTI, and LiDAR-KITTI. These datasets were preprocessed to remove non-corresponding points or retain occluded points using mask labels.

Experimental Settings

The model was implemented with PyTorch 1.9 and trained on an NVIDIA RTX3090 GPU. The AdamW optimizer was used with an initial learning rate of 0.001, decayed by half every 80 epochs. The model was trained for 900 epochs with a batch size of 8. Cross-attention was utilized with head = 8 and ( d_a = 128 ).

Results and Analysis

The method exhibits remarkable generalization across various scenarios, outperforming recent state-of-the-art methods on the FT3Ds and KITTIs datasets. The model achieves significant improvements in real-world datasets, demonstrating exceptional generalization performance.

Ablation Study

Ablation experiments were conducted to investigate the distinct impacts of the GF, STR, and DA Losses modules. The results highlight the importance of the DCA Fusion, external position encoder, and the effectiveness of the LFC and CFS losses.

Conclusion

The SSRFlow network accurately and robustly estimates scene flow by conducting global semantic feature fusion and attentive flow embedding in both Euclidean and context spaces. The Spatial Temporal Re-embedding module effectively re-embeds deformed spatiotemporal features within local refinement. The Domain Adaptive Losses enhance the generalization ability of SSRFlow on various pattern datasets. Experiments show that the method achieves state-of-the-art performance on multiple distinct datasets.

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

SSRFlow: Semantic-aware Fusion with Spatial Temporal Re-embedding for Real-world Scene Flow

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

SSRFlow: Semantic-aware Fusion with Spatial Temporal Re-embedding for Real-world Scene Flow

Authors:

Paper:

Introduction

Methodology

Problem Definition

Hierarchical Feature Extraction

Global Fusion Flow Embedding

Warping Layer

Spatial Temporal Re-embedding

Flow Prediction

Training Losses

Hierarchical Supervised Loss

Domain Adaptive Losses

Local Flow Consistency (LFC) Loss

Cross-frame Feature Similarity (CFS) Loss

Experiments

Datasets and Data Preprocessing

Experimental Settings

Results and Analysis

Ablation Study

Conclusion

Related Posts