Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding

About

Multimodal Large Language Models (MLLMs) have revolutionized video understanding, yet are still limited by context length when processing long videos. Recent methods compress videos by leveraging visual redundancy uniformly, yielding promising results. Nevertheless, our quantitative analysis shows that redundancy varies significantly across time and model layers, necessitating a more flexible compression strategy. We propose AdaReTaKe, a training-free method that flexibly reduces visual redundancy by allocating compression ratios among time and layers with theoretical guarantees. Integrated into state-of-the-art MLLMs, AdaReTaKe improves processing capacity from 256 to 2048 frames while preserving critical information. Experiments on VideoMME, MLVU, LongVideoBench, and LVBench datasets demonstrate that AdaReTaKe outperforms existing methods by 2.3% and 2.8% for 7B and 72B models, respectively, with even greater improvements of 5.9% and 6.0% on the longest LVBench. Our code is available at https://github.com/SCZwangxiao/video-FlexReduc.git.

Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie• 2025

Related benchmarks

TaskDatasetResultRank
Long Video UnderstandingLongVideoBench
Score67
248
Video UnderstandingVideoMME
Overall Score73.5
222
Long Video UnderstandingLongVideoBench (val)
Accuracy67
210
Video Question AnsweringLongVideoBench--
180
Long Video UnderstandingMLVU
Score78.1
154
Long Video UnderstandingLVBench
Accuracy53.3
133
Video Question AnsweringLVBench
Accuracy53.3
108
Long Video UnderstandingMLVU (dev)
Score78.1
63
Long Video UnderstandingMLVU (test)--
60
Video Question AnsweringLongVideoBench (val)
Accuracy67
55
Showing 10 of 24 rows

Other info

Code

Follow for update