Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models

About

Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long contexts, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Our code is available at https://github.com/hrlics/HoPE.

Haoran Li, Yingjie Qin, Baoyuan Ou, Lai Xu, Ruiwen Xu• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench--
425
Chart Question AnsweringChartQA--
356
Document Visual Question AnsweringDocVQA--
263
Video UnderstandingVideoMME--
248
Optical Character RecognitionOCRBench
Score66.6
232
Diagram Question AnsweringAI2D--
232
Video UnderstandingVideoMME
Overall Score59.52
222
Video UnderstandingMLVU
Score61.72
221
Visual GroundingRefCOCO+ (val)
Accuracy69.61
212
Visual GroundingRefCOCO+ (testA)
Accuracy74.55
206
Showing 10 of 52 rows

Other info

Follow for update