Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Dynamic Latent Routing

About

We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.

Fangyuan Yu, Xin Su, Amir Abdullah• 2026

Related benchmarks

TaskDatasetResultRank
Question AnsweringStrategyQA (test)
Task Accuracy72.9
74
Question AnsweringCSQA (test)
Accuracy87.2
68
Question AnsweringGSM8K (test)
Accuracy84.3
35
Question AnsweringSciQA (test)
Accuracy80.6
30
Showing 4 of 4 rows

Other info

Follow for update