CuBridge: An LLM-Based Framework for Understanding and Reconstructing High-Performance Attention Kernels

About

Efficient CUDA implementations of attention mechanisms are critical to modern deep learning systems, yet supporting diverse and evolving attention variants remains challenging. Existing frameworks and compilers trade performance for flexibility, while expert-written kernels achieve high efficiency but are difficult to adapt. Recent work explores large language models (LLMs) for GPU kernel generation, but prior studies report unstable correctness and significant performance gaps for complex operators such as attention. We present CuBridge, an LLM-based framework that adapts expert-written attention kernels through a structured lift-transfer-lower workflow. CuBridge starts from expert-written CUDA attention kernels and lifts them into an executable intermediate representation that makes execution orchestration explicit while abstracting low-level CUDA syntax. Given a user-provided PyTorch specification, CuBridge generates and verifies a target IR program, then reconstructs optimized CUDA code via reference-guided lowering. Across diverse attention variants and GPU platforms, CuBridge consistently produces correct kernels and substantially outperforms general frameworks, compiler-based approaches, and prior LLM-based methods.

Xing Ma, Yangjie Zhou, Wu Sun, Zihan Liu, Jingwen Leng, Yun Lin, Shixuan Sun, Minyi Guo, Jin Song Dong• 2026

Related benchmarks

Task	Dataset	Result
Attention Operator Throughput	Llama 405B (128 Q-heads/8 KV-heads/128 Head-dimension) 3.1	TFLOPS615.4	62
Attention Operator Throughput	Llama2 7B (32 Q-heads/32 KV-heads/128 Head-dimension)	Attention TFLOPS145.7	42
Attention Operator Throughput	Qwen2.5 72B (64 Q-heads/8 KV-heads/128 Head-dimension)	--	29
Causal Blockwise Mask Attention	Llama2-7b q=32, k=32 (1k)	TFLOPS35.12	4
Global Sliding Window Attention	Llama2-7b q=32, k=32 (1k)	TFLOPS67.36	4
PrefixLM Attention	Llama2-7b (q=32, k=32) (8k)	TFLOPS (PrefixLM Attention)163.7	4
PrefixLM Attention	Qwen2.5 72B (q=64, k=8) (1k)	PrefixLM Attention Throughput (TFLOPS)103.6	4
PrefixLM Attention	Llama3.1 405B (q=128, k=8) (1k)	PrefixLM Attention TFLOPS (1k)122.2	4
Share Question Mask Attention	Llama2-7b q=32, k=32 (1k)	TFLOPS (Share QK Mask Attention, 1k)39.81	4
Relative Pos. Attention	Llama2-7b q=32, k=32 (1k)	TFLOPS (Relative Pos. Attention)105.8	4

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord