Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

About

We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49\%$ to $61\%$ on Trainium 1 and from $45\%$ to $59\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper. The code is open-sourced at https://github.com/zhang677/AccelOpt.

Genghan Zhang, Shaowei Zhu, Anjiang Wei, Zhenyu Song, Allen Nie, Zhen Jia, Nandita Vijaykumar, Yida Wang, Kunle Olukotun• 2025

Related benchmarks

TaskDatasetResultRank
Conv2d kernel generationSM90
Latency (ms)0.0658
4
GEMM kernel generationSM90
Latency (ms)0.319
4
Top-K kernel generationSM90
Latency (ms)0.1276
4
Conv2d kernel generationSM120
Latency (ms)0.0822
4
GEMM kernel generationSM120
Latency (ms)0.3933
4
FMHA kernel generationSM90
Latency (ms)5.1499
3
FMHA kernel generationSM120
FMHA Latency (ms)3.6366
3
Top-K kernel generationSM120
Latency (ms)0.0192
3
Showing 8 of 8 rows

Other info

GitHub

Follow for update