SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

About

While FP8 attention has shown substantial promise in innovations like FlashAttention-3, its integration into the decoding phase of the DeepSeek Multi-head Latent Attention (MLA) architecture presents notable challenges. These challenges include numerical heterogeneity arising from the decoupling of positional embeddings, misalignment of quantization scales in FP8 PV GEMM, and the need for optimized system-level support. In this paper, we introduce SnapMLA, an FP8 MLA decoding framework optimized to improve long-context efficiency through the following hardware-aware algorithm-kernel co-optimization techniques: (i) RoPE-Aware Per-Token KV Quantization: Motivated by our analysis of the heterogeneous quantization sensitivity inherent to the MLA KV cache, this approach preserves the RoPE part in high precision. Furthermore, per-token granularity is employed to align with the autoregressive decoding process and maintain quantization accuracy. (ii) Quantized PV Computation Pipeline Reconstruction: Addresses the misalignment of quantization scales in FP8 PV computation caused by the shared KV structure of the MLA. (iii) End-to-End Dataflow Optimization: Establishes an efficient data read-and-write workflow using specialized kernels, ensuring streamlined data flow and improved performance. Extensive experiments on state-of-the-art MLA LLMs show that SnapMLA achieves up to a 1.91x improvement in throughput on long-output decoding workloads while maintaining near-parity benchmark quality compared with the BF16 baseline on the evaluated reasoning and code-generation benchmarks. Code is available at https://github.com/meituan-longcat/SGLang-FluentLLM.

Yifan Zhang, Zunhai Su, Shuhao Hu, Rui Yang, Wei Wu, Yulei Qian, Yuchen Xie, Xunliang Cai• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Mean@10.988	55
Mathematical Reasoning	AIME 25	Mean@3288.44	30
Alignment	IFEval strict prompt	pass@187.8	26
Mathematical Reasoning	AIME 24	Avg@32 Accuracy93.65	23
General QA	MMLU-Redux	Exact Match90.89	7
Alignment	Arena Hard	Hard Prompt Gemini Score70.4	4
Coding	LiveCodeBench (LCB) 24.08-25.05	Mean@479.74	4
General QA	MMLU-Pro	Accuracy84.43	4
General Reasoning	GPQA Diamond	Mean@1682.57	4
General Reasoning	ZebraLogic	Mean@196	4

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord