Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding

About

Current Video Large Language Models (Video LLMs) typically encode frames via a vision encoder and employ an autoregressive (AR) LLM for understanding and generation. However, this AR paradigm inevitably faces a dual efficiency bottleneck: strictly unidirectional attention compromises understanding efficiency by hindering global spatiotemporal aggregation, while serial decoding restricts generation efficiency. To address this, we propose VidLaDA, a Video LLM based on Diffusion Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive spatiotemporal modeling and decode tokens in parallel. To further mitigate the computational overhead of diffusion decoding, we introduce MARS-Cache, an acceleration strategy that prunes redundancy by combining asynchronous visual cache refreshing with frame-wise chunk attention. Experiments show VidLaDA rivals state-of-the-art AR baselines (e.g., Qwen2.5-VL and LLaVA-Video) and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy. Code and checkpoints are open-sourced at https://github.com/ziHoHe/VidLaDA.

Zhihao He, Tieyuan Chen, Kangyu Wang, Ziran Qin, Yang Shao, Chaofan Gan, Shijie Li, Zuxuan Wu, Weiyao Lin• 2026

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy59.4
247
Video UnderstandingVideoMME--
192
Long-form Video UnderstandingLongVideoBench
Accuracy61.4
82
Long Video UnderstandingMLVU (test)
Average Score53.4
41
Egocentric Video UnderstandingEgoSchema--
39
Long Video UnderstandingMLVU (dev)--
31
Video UnderstandingLVBench
Overall Accuracy44.7
23
Video UnderstandingVideo-MMMU
Accuracy46.6
23
Showing 8 of 8 rows

Other info

Follow for update