Meta-Reinforcement Learning with Self-Reflection for Agentic Search

About

This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR-Search.

Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, Hannaneh Hajishirzi• 2026

Related benchmarks

Task	Dataset	Result
Single-hop Question Answering	PopQA	EM47.2	186
Single-hop Question Answering	TriviaQA	EM66.6	133
Question Answering	NQ (test)	EM Accuracy47.7	133
Question Answering	PopQA (test)	Accuracy46	111
Question Answering	TriviaQA (test)	EM63.5	80
Question Answering	MuSiQue (test)	EM16.5	76
Multi-hop Question Answering	HotpotQA	Exact Match (EM)46.8	66
Single-hop Question Answering	NQ	Exact Match (EM)50.2	60
Multi-hop Question Answering	Bamboogle	Exact Match (EM)45.2	55
Multi-hop Question Answering	MuSiQue	Exact Match (EM)22.1	51

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord