Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
About
We present Seir\^enes, a self-play RL framework that transforms contextual interference from a failure mode of LLM reasoning into an internal training signal for co-evolving more resilient reasoners. While RL with verifiable rewards has significantly advanced reasoning capabilities, models can still exhibit fragility when encountering non-idealized contexts: scenarios characterized by superfluous information, tangential instructions, or incidental correlations that differ from the clean distributions typical of standard benchmarks. Seir\^enes harnesses this vulnerability through a parameter-shared and adversarial self-play loop. Within this framework, a single model is trained to both construct plausible yet distracting contexts that expose its own reasoning blind spots, and solve problems by discerning the essential task from these perturbations to recover the core underlying logic. By pitting these competing objectives against each other, Seir\^enes compels the model to move beyond superficial pattern matching and anchors its capabilities in robust underlying reasoning. This continuous interaction sustains an informative co-evolutionary curriculum as the model improves. Across seven mathematical reasoning benchmarks and model scales from 4B to 30B, Seir\^enes achieves average gains of +10.2, +9.1, and +7.2 points. Besides, distracting contexts produced by the 4B Seir\^enes model reduce the accuracy of top-tier closed-source models (GPT and Gemini) by roughly 4--5 points, revealing Seir\^enes' general ability to uncover reasoning models' blind spots.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 2025 | Accuracy74.1 | 214 | |
| Mathematical Reasoning | IMO-Bench | Accuracy46.7 | 57 | |
| Mathematical Reasoning | AIME 2026 | AIME 2026 Accuracy79.7 | 55 | |
| Mathematical Reasoning | HMMT 2026 | Accuracy49 | 16 | |
| Mathematical Reasoning | Mathematical Reasoning Suite Overall | Average Score63.9 | 16 | |
| Mathematical Reasoning | Math-Perturb | Math-P Hard Score79.1 | 5 | |
| Mathematical Reasoning | Robustness Evaluation Suite (GSMIR, MMLU-P, OBook-P) | GSMIR Score3.05 | 5 |