V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

Authors:

Junwei You、Haotian Shi、Zhuoyu Jiang、Zilin Huang、Rui Gan、Keshu Wu、Xi Cheng、Xiaopeng Li、Bin Ran

Paper:

Introduction

Background

The field of autonomous driving has seen significant advancements, particularly with the development of end-to-end (E2E) systems. These systems manage the entire driving process, from environmental perception to vehicle control, using integrated machine learning models to process complex environmental data in real-time. The integration of large-scale machine learning models, such as large language models (LLMs) and vision-language models (VLMs), has further enhanced the capabilities of these systems.

Problem Statement

Despite the progress, there remains a notable gap in the application of foundation models like VLMs in E2E cooperative autonomous driving. Traditional AI models struggle with integrating complex and multimodal data from various sources. This study introduces V2X-VLM, an innovative E2E vehicle-infrastructure cooperative autonomous driving (VICAD) framework leveraging large VLMs to enhance situational awareness, decision-making, and overall driving performance.

Related Work

LLM Enhanced E2E Autonomous Driving

Recent research has highlighted the transformative potential of LLMs in E2E autonomous driving. For instance, DriveGPT4 integrates multi-frame video inputs and textual queries to predict vehicle actions, enhancing transparency and user trust. Similarly, LMDrive utilizes multimodal sensor data and natural language instructions to generate control signals, facilitating real-time interaction with complex environments.

VLM Advancement in E2E Autonomous Driving

The development of VLMs has expanded the capabilities of autonomous systems by integrating visual and linguistic processing. DriveVLM, for example, combines scene description, analysis, and hierarchical planning modules to tackle challenges in urban driving scenarios. VLP leverages VLMs to integrate common-sense reasoning into autonomous driving systems, enhancing the system’s ability to generalize across diverse urban environments.

V2X Enabled Cooperative Autonomous Driving

V2X communication systems enable vehicles to communicate with each other and with infrastructure elements, providing real-time updates and a broader context of the driving environment. Studies have demonstrated the potential of V2X systems to improve situational awareness, maneuvering capabilities, and overall traffic safety and efficiency.

Research Methodology

Problem Formulation

The goal of the proposed E2E framework V2X-VLM is to plan the optimal trajectory for the ego vehicle by integrating and processing multimodal data from various sources. The data includes images from vehicle-mounted cameras, infrastructure cameras, and textual information indicating the current vehicle position.

Overall V2X-VLM Framework Architecture

The V2X-VLM framework integrates data from vehicle-mounted cameras, infrastructure sensors, and textual information to form a comprehensive E2E system for cooperative autonomous driving. The large VLM synthesizes and analyzes diverse input types to provide a holistic understanding of the driving environment.

Scene Understanding and Interpretation

VLM plays a crucial role in understanding and interpreting perception images captured from both the vehicle and infrastructure sides. It identifies essential elements such as types of nearby vehicles, road signs, traffic signals, weather conditions, and broader traffic patterns, providing a comprehensive understanding of the environment.

E2E VICAD Multimodel Input Paradigm for VLMs

The proposed paradigm for handling multimodal data emphasizes simplicity and effectiveness. Data from vehicle-mounted and infrastructure cameras are combined into pairs, with each pair embedded with a text prompt indicating the ego vehicle’s current position. This approach reduces computational overhead and redundancy, making real-time processing more feasible.

Experimental Design

Dataset

The DAIR-V2X dataset is used for evaluation. It includes extensive data from vehicle-mounted and infrastructure sensors, capturing RGB images and LiDAR data across diverse traffic scenarios. The dataset features detailed annotations, facilitating the development of advanced V2X systems.

Setups

V2X-VLM utilizes a pre-trained VLM known as Florence-2. The vision tower of the model is pre-trained and frozen during fine-tuning to leverage robust visual feature representations. The experiments are conducted on a single NVIDIA RTX 4090 GPU.

Baselines

The performance of V2X-VLM is evaluated against several baseline methods, including No Fusion, Vanilla, V2VNet, CooperNaut, and UniV2X. Metrics such as L2 Error and Transmission Cost are used to assess the effectiveness, safety, and communication efficiency of the proposed framework.

Results and Analysis

Results Evaluation

The V2X-VLM framework demonstrates superior performance with the lowest average L2 Error of 1.40 meters, significantly outperforming all baseline methods. The transmission cost, while higher than some methods, is justified by the significantly improved accuracy and safety outcomes.

Results Visualization

Figure 4 visualizes the planned trajectories from the V2X-VLM framework, demonstrating its ability to produce high-quality trajectory outputs across various driving scenarios.

Overall Conclusion

This study presents V2X-VLM, an innovative framework that advances the field of VICAD by leveraging the capabilities of large VLMs. V2X-VLM excels in integrating and processing multimodal data, resulting in precise and efficient trajectory planning. Future research will focus on diversifying the model’s output, optimizing data transmission efficiency, and training on more diverse datasets to improve the system’s adaptability and robustness.

Author Contributions

Junwei You, Haotian Shi, Zhuoyu Jiang, Zilin Huang, Rui Gan, Keshu Wu, Xi Cheng, Xiaopeng Li, and Bin Ran collectively developed the research concept and methodology. Junwei You, Haotian Shi, Zhuoyu Jiang, Zilin Huang, Rui Gan, Keshu Wu, and Xi Cheng contributed to the implementation and experiments. Xiaopeng Li and Bin Ran provided overall guidance and supervision. All authors reviewed and approved the final manuscript.

What's Hot

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency

Our Picks

AAAI.2024 – Humans and AI

How Diffusion Models Learn to Factorize and Compose

Temporal Fairness in Decision Making Problems

Subscribe to Updates

What's Hot

V2X-VLM: End-to-End V2X Cooperative Autonomous Driving Through Large Vision-Language Models

Authors:

Paper:

Introduction

Background

Problem Statement

Related Work

LLM Enhanced E2E Autonomous Driving

VLM Advancement in E2E Autonomous Driving

V2X Enabled Cooperative Autonomous Driving

Research Methodology

Problem Formulation

Overall V2X-VLM Framework Architecture

Scene Understanding and Interpretation

E2E VICAD Multimodel Input Paradigm for VLMs

Experimental Design

Dataset

Setups

Baselines

Results and Analysis

Results Evaluation

Results Visualization

Overall Conclusion

Author Contributions

Related Posts