Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
About
In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematics | GSM8K | GSM8K Score72.44 | 87 | |
| General Knowledge | MMLU-Pro | MMLU-Pro General Knowledge Score38.83 | 55 | |
| General Knowledge | CMMLU | Accuracy68.41 | 50 | |
| Long-context retrieval | RULER 16k | Score78.06 | 28 | |
| General Knowledge | MMLU | General Score64.33 | 25 | |
| Math | CMath | Score79.09 | 22 | |
| Long-context retrieval | RULER 64K context | Accuracy73.5 | 19 | |
| General Knowledge | CEval | Score67.42 | 19 | |
| Long-context retrieval | RULER 128k | Score67.98 | 12 | |
| Long-context retrieval | RULER 4k | Score86.39 | 12 |