Efficient Matrix Implementation for Rotary Position Embedding
About
Rotary Position Embedding (RoPE) has become a core component of modern Transformer architectures across language, vision, and 3D domains. However, existing implementations rely on vector-level split and merge operations that introduce non-negligible computational overhead, often overlooked in attention optimization. The problem is further amplified in multi-dimensional settings (e.g., 2D and 3D RoPE), where additional vector operations and uneven feature partitions degrade hardware utilization. To overcome these limitations, we propose RoME (Rotary Matrix position Embedding), a mathematically equivalent yet computationally efficient reformulation of RoPE that replaces vector operations with unified matrix transformations. RoME eliminates dimension-specific operations, simplifies implementation, and enables fused parallel execution across Cube and Vector units on modern NPUs. Experiments show that RoME delivers substantial acceleration at both the operator and full-model levels. The implementation is available at https://gitcode.com/cann/ops-transformer/blob/master/experimental/posembedding/rope_matrix/README.md.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | Qwen-Image | Latency (s)1.401 | 25 | |
| Text-to-Video Generation Inference | T2V Models HunyuanVideo, Wan2.2 | Latency (ms)199 | 12 | |
| Position Embedding Application | Positional Embedding Operator | Latency (ms)1.6 | 11 | |
| Training | wan 1.3B 2.2 | Step Time1.22 | 4 | |
| Image Editing Inference | FLUX.1-Kontext | Latency (ms)978 | 2 | |
| Large Language Model Inference | Llama 3.1 | Latency (ms)48.1 | 2 | |
| Vision Language Model Inference | InternVL 3 | Latency (ms)166 | 2 |