Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Efficient Matrix Implementation for Rotary Position Embedding

About

Rotary Position Embedding (RoPE) has become a core component of modern Transformer architectures across language, vision, and 3D domains. However, existing implementations rely on vector-level split and merge operations that introduce non-negligible computational overhead, often overlooked in attention optimization. The problem is further amplified in multi-dimensional settings (e.g., 2D and 3D RoPE), where additional vector operations and uneven feature partitions degrade hardware utilization. To overcome these limitations, we propose RoME (Rotary Matrix position Embedding), a mathematically equivalent yet computationally efficient reformulation of RoPE that replaces vector operations with unified matrix transformations. RoME eliminates dimension-specific operations, simplifies implementation, and enables fused parallel execution across Cube and Vector units on modern NPUs. Experiments show that RoME delivers substantial acceleration at both the operator and full-model levels. The implementation is available at https://gitcode.com/cann/ops-transformer/blob/master/experimental/posembedding/rope_matrix/README.md.

Chen Minqi, Zhongqi Yue, Shihao Zhang, Yun Xu, Peng Wu, kaixiang Xu, Zeyi Huang, Hanwang Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationQwen-Image
Latency (s)1.401
25
Text-to-Video Generation InferenceT2V Models HunyuanVideo, Wan2.2
Latency (ms)199
12
Position Embedding ApplicationPositional Embedding Operator
Latency (ms)1.6
11
Trainingwan 1.3B 2.2
Step Time1.22
4
Image Editing InferenceFLUX.1-Kontext
Latency (ms)978
2
Large Language Model InferenceLlama 3.1
Latency (ms)48.1
2
Vision Language Model InferenceInternVL 3
Latency (ms)166
2
Showing 7 of 7 rows

Other info

Follow for update