$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

About

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information: self-play can provide high-quality privileged information for the self-distillation at low cost and at scale, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play ($\pi$-Play), a novel multi-agent self-evolution framework combining self-play and self-distillation. In $\pi$-Play, an examiner generates tasks together with QCPs, and a teacher employs QCP as privileged context to densely supervise a student via self-distillation. This design transforms sparse-reward self-play into a dense-feedback co-evolution. Extensive experiments show that data-free $\pi$-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3$\times$ over conventional self-play. Code is available at https://github.com/zhyaoch/pi-play.

Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, Dongbin Zhao• 2026

Related benchmarks

Task	Dataset	Result
Multi-hop Question Answering	2WikiMQA	--	161
General Question Answering	NQ	Exact Match (EM)43	52
General Question Answering	TriviaQA	Score64.6	16
Multi-hop Question Answering	MuSiQue	Score13.4	16
Multi-hop Question Answering	Bamboogle	Score44	16
Multi-hop Question Answering	HotpotQA	HotpotQA Score38.9	15
Question Answering	Combined NQ, TriviaQA, PopQA, HotpotQA, 2WikiMQA, MuSiQue, Bamboogle	Total Score280.3	15

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord