Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

About

We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 1.1-2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.

Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, Souvik Kundu• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8k 5-shot	Accuracy78.01	54
Code Generation	HumanEval 0-shot (test)	Accuracy43.29	23
Code Generation	MBPP 3-shot pass@1	Accuracy24	6

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord