VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

About

Current Video Large Language Models (Video LLMs) typically encode frames via a vision encoder and employ an autoregressive (AR) LLM for understanding and generation. However, this AR paradigm inevitably faces a dual efficiency bottleneck: strictly unidirectional attention compromises understanding efficiency by hindering global spatiotemporal aggregation, while serial decoding restricts generation efficiency. To address this, we propose VidLaDA, a Video LLM based on Diffusion Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive spatiotemporal modeling and decode tokens in parallel. To further mitigate the computational overhead of diffusion decoding, we introduce MARS-Cache, an acceleration strategy that prunes redundancy by combining asynchronous visual cache refreshing with frame-wise chunk attention. Experiments show VidLaDA rivals state-of-the-art AR baselines (e.g., Qwen2.5-VL and LLaVA-Video) and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy. Code and checkpoints are open-sourced at https://github.com/ziHoHe/VidLaDA.

Zhihao He, Tieyuan Chen, Kangyu Wang, Ziran Qin, Yang Shao, Chaofan Gan, Shijie Li, Zuxuan Wu, Weiyao Lin• 2026

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy59.4	563
Video Understanding	VideoMME	--	222
Long-form Video Understanding	LongVideoBench	Accuracy61.4	135
Video Understanding	LVBench	--	75
Long Video Understanding	MLVU (dev)	--	63
Long Video Understanding	MLVU (test)	Average Score53.4	60
Egocentric Video Understanding	EgoSchema	--	42
Video Understanding	Video-MMMU	Accuracy46.6	23

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord