LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents

About

LangDriveCTRL is a natural-language-controllable framework for editing real-world driving videos to synthesize diverse traffic scenarios. It represents each video as an explicit 3D scene graph, decomposing the scene into a static background and dynamic object nodes. To enable fine-grained editing and realism, it introduces a feedback-driven agentic pipeline. An Orchestrator converts user instructions into executable graphs that coordinate specialized multi-modal agents and tools. An Object Grounding Agent aligns free-form text with target object nodes in the scene graph; a Behavior Editing Agent generates multi-object trajectories from language instructions; and a Behavior Reviewer Agent iteratively reviews and refines the generated trajectories. The edited scene graph is rendered and harmonized using a video diffusion tool, and then further refined by a Video Reviewer Agent to ensure photorealism and appearance alignment. LangDriveCTRL supports both object node editing (removal, insertion, and replacement) and multi-object behavior editing from natural-language instructions. Quantitatively, it achieves nearly $2\times$ higher instruction alignment than the previous SoTA, with superior photorealism, structural preservation, and traffic realism. Project page is available at: https://yunhe24.github.io/langdrivectrl/.

Yun He, Francesco Pittaluga, Ziyu Jiang, Matthias Zwicker, Manmohan Chandraker, Zaid Tasneem• 2025

Related benchmarks

Task	Dataset	Result	Rank
Instruction-based Driving Video Editing	Waymo Open Dataset (test)	FID32.85		3
Object Grounding	Custom 5-scene 50 queries (test)	Accuracy84		3

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord