Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SPICE: Self-Play In Corpus Environments Improves Reasoning

About

Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner's capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.

Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, Jason Weston• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy93.9
1398
Mathematical ReasoningMATH 500
Accuracy (Acc)79.4
543
Mathematical ReasoningAIME 2024
Accuracy18.4
479
Mathematical ReasoningMATH 500
Accuracy81.8
442
Mathematical ReasoningAIME 2024
Accuracy15.2
370
Mathematical ReasoningAMC
Accuracy (%)70
368
Mathematical ReasoningGSM8K
Accuracy (Acc)92.7
337
Mathematical ReasoningAIME 2025
Accuracy19.1
311
Mathematical ReasoningAIME 2024
Pass@1 Accuracy12.2
236
Mathematical ReasoningAMC
Accuracy70
221
Showing 10 of 35 rows

Other info

Follow for update