AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth

About

Test-time scaling via recurrent/iterative Transformers enables large language models to spend more computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pretraining without manually tuned per-token/per-layer pruning ratios. AdaPonderLM uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens, ensuring train--test consistency and practical acceleration. Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy. Our analysis shows the learned gates allocate more computation to high-NLL (hard) tokens, exhibiting adaptive computation time behavior in a fully self-supervised setting. Meanwhile, under iso-FLOPs, the learned halting policy consistently outperforms fixed pruning, showing AdaPonderLM allocates compute to the right tokens rather than just reducing average depth.

Shixiang Song, He Li, Zitong Wang, Boyi Zeng, Feichen Song, Yixuan Wang, Zhiqin John Xu, Ziwei He, Zhouhan Lin• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	Accuracy63.6	1442
Question Answering	ARC-E	Accuracy70.9	523
Question Answering	PIQA	Accuracy75.9	505
Sentence Completion	HellaSwag	Accuracy48.9	364
Question Answering	SciQ	--	283
Language Modeling	Lambada OpenAI	Accuracy68.3	127
Reading Comprehension	RACE	Accuracy38.5	75
Question Answering	ARC-C	Accuracy (ARC-C)35.2	46
Language Modeling	Lambada Standard	Accuracy59.8	36
Mean Performance Evaluation	Downstream Tasks Summary	Average Accuracy61.1	36

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord