Dynamic Latent Routing

About

We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.

Fangyuan Yu, Xin Su, Amir Abdullah• 2026

Related benchmarks

Task	Dataset	Result
Question Answering	StrategyQA (test)	Task Accuracy72.9	74
Question Answering	CSQA (test)	Accuracy87.2	68
Question Answering	GSM8K (test)	Accuracy84.3	35
Question Answering	SciQA (test)	Accuracy80.6	30

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord