RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving

About

Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose RoboTron-Drive, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD datasets to finetune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on three unseen datasets, where RoboTron-Drive achieves state-of-the-art performance across all tasks. We hope RoboTron-Drive as a promising solution for AD in the real world. Project page with code: https://github.com/zhijian11/RoboTron-Drive.

Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, Lin Ma• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	InfoVQA	Accuracy42.6	264
Open-loop planning	nuScenes	L2 Error (Avg)0.33	130
Open-loop planning	NuScenes v1.0 (test)	L2 Error (1s)0.14	84
Visual Question Answering	TallyQA	Accuracy63.4	49
Autonomous Driving Reasoning	DriveLMM-o1	--	42
Temporal Autonomous Driving Understanding	TAD 1.0 (test)	EA Action Recognition43.63	32
Autonomous driving reasoning (cross-view risk object perception, action prediction, and planning)	DriveLM	Accuracy81	25
Autonomous Driving (Perception, Prediction & Planning)	MME RealWorld	Overall Score (P+P+P)41.3	25
End-to-end Planning	nuScenes (open-loop)	L2 Error (1s)0.14	24
Structured Occlusion Reasoning	nuScenes PKL-guided v1.0 (val)	Agent Class Accuracy0.4	18

Showing 10 of 54 rows

Other info

Follow for update

@wizwand_team Discord