Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
About
Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to vision-language models (VLMs), RoPE and its variants enforce relative positional dependencies separately within text and image tokens, introducing unintended cross-modal positional biases. For example, image tokens depicting semantically consistent content are assigned distinct positional encodings solely due to spatial location variations. As a result, such tokens exhibit entirely different relative positional relationships with their corresponding text tokens, ultimately leading to misaligned cross-modal representations. To address this, we propose Per-Token Distance, a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme designed to eliminate spurious cross-modal biases. Our key idea is to project image token indices onto a \emph{ring} that is orthogonal to the linear axis of text token indices, thereby forming a cone-like structure in the positional encoding space. In this configuration, each text token (point on the linear text axis) becomes the apex of a cone and maintains an equal distance to all image tokens (points on the circular image \emph{ring}), reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered strategy that applies different RoPE variants across layers. Extensive experiments demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for VLMs. The code is available at https://github.com/lose4578/CircleRoPE.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | MVBench | -- | 425 | |
| Chart Question Answering | ChartQA | -- | 356 | |
| Document Visual Question Answering | DocVQA | -- | 263 | |
| Video Understanding | VideoMME | -- | 248 | |
| Optical Character Recognition | OCRBench | Score70.6 | 232 | |
| Diagram Question Answering | AI2D | -- | 232 | |
| Video Understanding | VideoMME | Overall Score57.7 | 222 | |
| Video Understanding | MLVU | Score59.4 | 221 | |
| Visual Grounding | RefCOCO+ (val) | Accuracy70.19 | 212 | |
| Visual Grounding | RefCOCO+ (testA) | Accuracy76.77 | 206 |