Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

About

Rotary Position Embedding (RoPE) is widely adopted in large language models, but when applied to vision-language models (VLMs) it couples text and image position indices and can introduce spurious cross-modal relative-position bias. We propose Per-Token Distance (PTD) to quantify cross-modal positional disentanglement, and prove that PTD = 0 is a sufficient condition to eliminate the geometric attention bias induced by RoPE. Guided by this criterion, we introduce Circle-RoPE, which remaps 2D image-token coordinates onto an annulus orthogonal to the text position axis, yielding a cone-like geometry where each text token is equidistant to all image tokens while preserving intra-image spatial structure. We further propose Alternating Geometry Encoding (AGE) to combine complementary geometric priors by alternating the decoupled geometry of Circle-RoPE and the grid-based prior of standard RoPE across layers. This design enables cross-modal positional disentanglement while preserving fine-grained intra-image spatial structure. Experiments on diverse VLM backbones and multimodal benchmarks show consistent gains in spatial grounding and visual reasoning. The code is available at https://github.com/lose4578/CircleRoPE.

Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench--
563
Optical Character RecognitionOCRBench
Score70.6
433
Diagram Question AnsweringAI2D--
387
Chart Question AnsweringChartQA--
371
Video UnderstandingVideoMME--
357
Document Visual Question AnsweringDocVQA--
301
Visual GroundingRefCOCO+ (val)
Accuracy70.19
253
Visual GroundingRefCOCO+ (testA)
Accuracy76.77
245
Visual PerceptionBLINK--
241
Video UnderstandingVideoMME
Overall Score57.7
222
Showing 10 of 57 rows

Other info

Follow for update