KernelBench: Can LLMs Write Efficient GPU Kernels?

About

Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce KernelBench, an open-source framework for evaluating LMs' ability to write fast and correct kernels on a suite of 250 carefully selected PyTorch ML workloads. KernelBench represents a real-world engineering environment and making progress on the introduced benchmark directly translates to faster practical kernels. We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct and offer a speedup greater than an adjustable threshold p over baseline. Our experiments across various state-of-the-art models and test-time methods show that frontier reasoning models perform the best out of the box but still fall short overall, matching the PyTorch baseline in less than 20% of the cases. While we show that results can improve by leveraging execution and profiling feedback during iterative refinement, KernelBench remains a challenging benchmark, with its difficulty increasing as we raise speedup threshold p.

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher R\'e, Azalia Mirhoseini• 2025

Related benchmarks

Task	Dataset	Result
Kernel Optimization	KernelBench 1.0 (test)	Latency (us)0.683	27
Batched Cumsum	KernelBench LLM-augmented shapes	S1 Time (ms)0.063	5
CUDA Kernel Generation	KernelBench Level 1 (single-op)	fast1 Performance (Native)43	5
CUDA Kernel Generation	KernelBench fusion Level 2	Fast1 (Natural)72	5

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord