Meta-Reinforcement Learning with Self-Reflection for Agentic Search
About
This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR-Search.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Single-hop Question Answering | PopQA | EM47.2 | 186 | |
| Single-hop Question Answering | TriviaQA | EM66.6 | 133 | |
| Question Answering | NQ (test) | EM Accuracy47.7 | 133 | |
| Question Answering | PopQA (test) | Accuracy46 | 111 | |
| Question Answering | TriviaQA (test) | EM63.5 | 80 | |
| Question Answering | MuSiQue (test) | EM16.5 | 76 | |
| Multi-hop Question Answering | HotpotQA | Exact Match (EM)46.8 | 66 | |
| Single-hop Question Answering | NQ | Exact Match (EM)50.2 | 60 | |
| Multi-hop Question Answering | Bamboogle | Exact Match (EM)45.2 | 55 | |
| Multi-hop Question Answering | MuSiQue | Exact Match (EM)22.1 | 51 |