Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
About
Rotary Position Embedding (RoPE) is widely adopted in large language models, but when applied to vision-language models (VLMs) it couples text and image position indices and can introduce spurious cross-modal relative-position bias. We propose Per-Token Distance (PTD) to quantify cross-modal positional disentanglement, and prove that PTD = 0 is a sufficient condition to eliminate the geometric attention bias induced by RoPE. Guided by this criterion, we introduce Circle-RoPE, which remaps 2D image-token coordinates onto an annulus orthogonal to the text position axis, yielding a cone-like geometry where each text token is equidistant to all image tokens while preserving intra-image spatial structure. We further propose Alternating Geometry Encoding (AGE) to combine complementary geometric priors by alternating the decoupled geometry of Circle-RoPE and the grid-based prior of standard RoPE across layers. This design enables cross-modal positional disentanglement while preserving fine-grained intra-image spatial structure. Experiments on diverse VLM backbones and multimodal benchmarks show consistent gains in spatial grounding and visual reasoning. The code is available at https://github.com/lose4578/CircleRoPE.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | MVBench | -- | 563 | |
| Optical Character Recognition | OCRBench | Score70.6 | 433 | |
| Diagram Question Answering | AI2D | -- | 387 | |
| Chart Question Answering | ChartQA | -- | 371 | |
| Video Understanding | VideoMME | -- | 357 | |
| Document Visual Question Answering | DocVQA | -- | 301 | |
| Visual Grounding | RefCOCO+ (val) | Accuracy70.19 | 253 | |
| Visual Grounding | RefCOCO+ (testA) | Accuracy76.77 | 245 | |
| Visual Perception | BLINK | -- | 241 | |
| Video Understanding | VideoMME | Overall Score57.7 | 222 |