Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

About

This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR-Search.

Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, Hannaneh Hajishirzi• 2026

Related benchmarks

TaskDatasetResultRank
Multi-hop Question AnsweringHotpotQA 2018 Wikipedia dump (dev)
Accuracy46.8
14
Multi-hop Question Answering2wiki 2018 Wikipedia dump (dev)
Accuracy (%)43.6
14
Multi-hop Question AnsweringMusique 2018 Wikipedia dump (dev)
Accuracy22.1
14
Multi-hop Question AnsweringBamboogle 2018 Wikipedia dump (dev)
Accuracy45.2
14
Single-hop Question AnsweringNQ (Natural Questions) 2018 Wikipedia dump (dev)
Accuracy50.2
14
Single-hop Question AnsweringTriviaQA 2018 Wikipedia dump (dev)
Accuracy66.6
14
Single-hop Question AnsweringPopQA 2018 Wikipedia dump (dev)
Accuracy47.2
14
Showing 7 of 7 rows

Other info

Follow for update