FlashSampling: Fast and Memory-Efficient Exact Sampling

About

Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that fuses sampling into the LM-head matmul and never materializes the logits tensor in HBM. The method is simple: compute logits tile-by-tile on chip, add Gumbel noise, keep only one maximizer per row and per vocabulary tile, and finish with a small reduction over tiles. In tensor-parallel decoding, FlashSampling replaces the all-gather of logits with streaming peer-to-peer writes: This overlaps GPU-to-GPU communication with computation and HBM loads across up to 8 GPUs, with near-ideal scaling at large batch sizes. Our kernel is exact because argmax decomposes over partitions; grouped variants for online and tensor-parallel settings are exact by hierarchical factorization of the categorical distribution. FlashSampling demonstrates kernel-level speedups on decode workloads across 4 different datacenter GPUs (H100, H200, B200, B300), and in end-to-end vLLM experiments, it reduces time per output token by up to $10\%$ on the models we test. These results show that exact sampling, with no approximation, can be integrated into the matmul itself, consolidating the bandwidth-bound sampling step in an efficient epilogue.

Tomas Ruiz, Zhen Qin, Yifan Zhang, Xuyang Shen, Yiran Zhong, Mengdi Wang• 2026

Related benchmarks

Task	Dataset	Result
Kernel Speedup	Synthetic Large Configuration (D=8192, V=128k)	Speedup1.88	108
Fused Matmul and Sampling	Synthetic D=4096, V=151k	Speedup vs Multinomial Sampling1.98	36
LLM Inference Performance	Synthetic Poisson Process Requests	TPOT Speedup (%)18.7	28

Showing 3 of 3 rows

Other info

GitHub

Follow for update

@wizwand_team Discord