Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Selective Off-Policy Reference Tuning with Plan Guidance

About

Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without changing rollout generation: it derives a plan from the reference solution, compares token probabilities with and without that plan, and gives higher weight to tokens that become more predictable under plan conditioning. This turns all-wrong prompts into selective, structure-aware learning signals instead of uniform imitation. Across three backbones and eight reasoning benchmarks, SORT improves over GRPO and guidance baselines, with largest gains on weaker models.

Duc Anh Le, Tien-Phat Nguyen, Thien Huu Nguyen, Linh Ngo Van, Trung Le• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningCompetition-level Math Benchmarks AIME24, AIME25, AMC23, MATH500, Olympiad, Minerva
AIME 24 Score62.5
52
General ReasoningGPQA-Diamond & MMLU-Pro
Accuracy72.1
35
Showing 2 of 2 rows

Other info

Follow for update