Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm

About

The attention operator remains a critical performance bottleneck in large language models (LLMs), particularly for long-context scenarios. While FlashAttention is the most widely used and effective GPU-aware acceleration algorithm, it must require time-consuming and hardware-specific manual implementation, limiting adaptability across GPU architectures. Existing LLMs have shown a lot of promise in code generation tasks, but struggle to generate high-performance attention code. The key challenge is it cannot comprehend the complex data flow and computation process of the attention operator and utilize low-level primitive to exploit GPU performance. To address the above challenge, we propose an LLM-friendly Thinking Language (LLM-TL) to help LLMs decouple the generation of high-level optimization logic and low-level implementation on GPU, and enhance LLMs' understanding of attention operator. Along with a 2-stage reasoning workflow, TL-Code generation and translation, the LLMs can automatically generate FlashAttention implementation on diverse GPUs, establishing a self-optimizing paradigm for generating high-performance attention operators in attention-centric algorithms. Verified on A100, RTX8000, and T4 GPUs, the performance of our methods significantly outshines that of vanilla LLMs, achieving a speed-up of up to 35.16x. Besides, our method not only surpasses human-optimized libraries (cuDNN and official library) in most scenarios but also extends support to unsupported hardware and data types, reducing development time from months to minutes compared with human experts.

Qirui Zhou, Shaohui Peng, Weiqiang Xiong, Haixin Chen, Yuanbo Wen, Haochen Li, Ling Li, Qi Guo, Yongwei Zhao, Ke Gao, Ruizhi Chen, Yanjun Wu, Chen Zhao, Yunji Chen• 2025

Related benchmarks

TaskDatasetResultRank
Attention Operator ThroughputLlama2 7B (32 Q-heads/32 KV-heads/128 Head-dimension)
Attention TFLOPS202.7
30
Attention Operator ThroughputQwen2.5 72B (64 Q-heads/8 KV-heads/128 Head-dimension)
Attention Throughput (TFLOPS)205.1
29
Attention Operator ThroughputLlama 405B (128 Q-heads/8 KV-heads/128 Head-dimension) 3.1
TFLOPS206.9
28
Latency EvaluationNSA Workload Head Dimension 128
Latency (s)0.67
12
Multi-Head Attention (MHA)MHA causal mask head dimension 128 FP8 on NVIDIA L40S GPU (test)
Performance (TFLOPS)257.9
6
Masked Multi-Head AttentionT4 GPU Synthetic Performance Benchmark
Performance (TFLOPS)19.07
5
Multi-Head Attention (MHA)NVIDIA A100 GPU
TFLOPS207.2
5
Attention Operator PerformanceMLA Sequence Length 512 128 Head Dimension
TFLOPS50.6
4
Attention Operator PerformanceMLA 128 Head Dimension Sequence Length 1k
TFLOPS78.6
4
Attention Operator PerformanceMLA 128 Head Dimension Sequence Length 2k
TFLOPS108.2
4
Showing 10 of 22 rows

Other info

Follow for update