AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth
About
Test-time scaling via recurrent/iterative Transformers enables large language models to spend more computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pretraining without manually tuned per-token/per-layer pruning ratios. AdaPonderLM uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens, ensuring train--test consistency and practical acceleration. Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy. Our analysis shows the learned gates allocate more computation to high-NLL (hard) tokens, exhibiting adaptive computation time behavior in a fully self-supervised setting. Meanwhile, under iso-FLOPs, the learned halting policy consistently outperforms fixed pruning, showing AdaPonderLM allocates compute to the right tokens rather than just reducing average depth.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | WinoGrande | Accuracy63.6 | 1085 | |
| Question Answering | ARC-E | Accuracy70.9 | 416 | |
| Question Answering | PIQA | Accuracy75.9 | 374 | |
| Question Answering | SciQ | -- | 283 | |
| Sentence Completion | HellaSwag | Accuracy48.9 | 276 | |
| Language Modeling | Lambada OpenAI | Accuracy68.3 | 127 | |
| Reading Comprehension | RACE | Accuracy38.5 | 70 | |
| Question Answering | ARC-C | Accuracy (ARC-C)35.2 | 46 | |
| Language Modeling | Lambada Standard | Accuracy59.8 | 36 | |
| Mean Performance Evaluation | Downstream Tasks Summary | Average Accuracy61.1 | 36 |