Thinking into the Future: Latent Lookahead Training for Transformers

About

Autoregressive language models trained with next-token prediction generate text by sampling one discrete token at a time. Although very scalable, this objective forces the model to commit at every step, preventing it from exploring or reflecting upon multiple plausible continuations. Furthermore, the compute allocation across tokens is uniform; every token is formed based on a single forward-pass, potentially limiting the model's expressiveness in cases where difficult tokens require inherently more compute. Towards addressing these limitations, we introduce latent lookahead, a training strategy that enables models to "think" before generating: at selected positions in the sequence, before committing to the next token, the model performs a multi-step lookahead in latent space. More precisely, instead of sampling future tokens, we leverage the network's latent space by recursively feeding its hidden states back into the context for $\tau$ steps, investing more compute on predicting that token. This produces $\tau$ latent predictions that are supervised against the next $\tau$ ground-truth tokens, encouraging the model to "lookahead" and refine its prediction. We show that latent lookahead substantially outperforms both autoregressive and non-autoregressive baselines on planning tasks such as maze solving, Sudoku, and ProsQA, where foresight is essential.

Lorenzo Noci, Gregor Bachmann, Seyed-Mohsen Moosavi-Dezfooli, Moin Nabi• 2026

Related benchmarks

Task	Dataset	Result
Math Reasoning	GSM8K	Accuracy39	254
Logical reasoning	BBH	Accuracy18.7	249
Math Reasoning	AQUA	Accuracy26.8	188
Logical reasoning	Sudoku	Accuracy11	142
Pathfinding	Maze	Accuracy21.5	3
Question Answering	ProsQA	Accuracy91.8	3
Sudoku Solving	Mini 4x4 Sudoku	Accuracy93.5	3
Sudoku Solving	Sudoku Full 9x9	Accuracy35.5	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord