Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

PAVE: Patching and Adapting Video Large Language Models

About

Pre-trained video large language models (Video LLMs) exhibit remarkable reasoning capabilities, yet adapting these models to new tasks involving additional modalities or data types (e.g., audio or 3D information) remains challenging. In this paper, we present PAVE, a flexible framework for adapting pre-trained Video LLMs to downstream tasks with side-channel signals, such as audio, 3D cues, or multi-view videos. PAVE introduces lightweight adapters, referred to as "patches," which add a small number of parameters and operations to a base model without modifying its architecture or pre-trained weights. In doing so, PAVE can effectively adapt the pre-trained base model to support diverse downstream tasks, including audio-visual question answering, 3D reasoning, multi-view video recognition, and high frame rate video understanding. Across these tasks, PAVE significantly enhances the performance of the base model, surpassing state-of-the-art task-specific models while incurring a minor cost of ~0.1% additional FLOPs and parameters. Further, PAVE supports multi-task learning and generalizes well across different Video LLMs. Our code is available at https://github.com/dragonlzm/PAVE.

Zhuoming Liu, Yiquan Li, Khoi Duc Nguyen, Yiwu Zhong, Yin Li• 2025

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringActivityNet-QA
Accuracy57.1
319
Video Question AnsweringNEXT-QA
Overall Accuracy79.6
105
Video Question AnsweringVideoMME--
99
Video Question AnsweringEgoSchema
Accuracy57.4
88
Video Question AnsweringMLVU
Accuracy67
53
Video Question AnsweringPerceptionTest
Accuracy56
31
3D Question AnsweringScanQA v1.0 (test)
ROUGE49
26
Video Question AnsweringVideoMME with subtitles
Acc (Overall)62.9
15
3D Question AnsweringSQA3D v1.0 (test)
EM@159
8
Multi-view video understandingEgo-Exo4D Demonstrator Proficiency
Accuracy44.2
7
Showing 10 of 14 rows

Other info

Code

Follow for update