Doe-1: Closed-Loop Autonomous Driving with Large World Model

About

End-to-end autonomous driving has received increasing attention due to its potential to learn from large amounts of data. However, most existing methods are still open-loop and suffer from weak scalability, lack of high-order interactions, and inefficient decision-making. In this paper, we explore a closed-loop framework for autonomous driving and propose a large Driving wOrld modEl (Doe-1) for unified perception, prediction, and planning. We formulate autonomous driving as a next-token generation problem and use multi-modal tokens to accomplish different tasks. Specifically, we use free-form texts (i.e., scene descriptions) for perception and generate future predictions directly in the RGB space with image tokens. For planning, we employ a position-aware tokenizer to effectively encode action into discrete tokens. We train a multi-modal transformer to autoregressively generate perception, prediction, and planning tokens in an end-to-end and unified manner. Experiments on the widely used nuScenes dataset demonstrate the effectiveness of Doe-1 in various tasks including visual question-answering, action-conditioned video generation, and motion planning. Code: https://github.com/wzzheng/Doe.

Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, Jiwen Lu• 2024

Related benchmarks

Task	Dataset	Result
Open-loop planning	nuScenes	L2 Error (Avg)1.26	121
Trajectory Planning	nuScenes	L2 Error (m) (1s)0.37	58
Open-loop planning	NuScenes v1.0 (test)	L2 Error (1s)0.5	50
Motion Planning	nuScenes v1.0 (val)	L2 Error (3s)2.11	29
End-to-end Motion Planning	nuScenes	L2 Displacement Error (1s)0.5	22
Frame prediction	nuScenes	FID15.9	16
Motion Planning	NuScenes v1.0 (test)	L2 Error (1s)0.5	9
Future frames generation	Bench2Drive (test)	FID18.6	8

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord