Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

About

We present Seir\^enes, a self-play RL framework that transforms contextual interference from a failure mode of LLM reasoning into an internal training signal for co-evolving more resilient reasoners. While RL with verifiable rewards has significantly advanced reasoning capabilities, models can still exhibit fragility when encountering non-idealized contexts: scenarios characterized by superfluous information, tangential instructions, or incidental correlations that differ from the clean distributions typical of standard benchmarks. Seir\^enes harnesses this vulnerability through a parameter-shared and adversarial self-play loop. Within this framework, a single model is trained to both construct plausible yet distracting contexts that expose its own reasoning blind spots, and solve problems by discerning the essential task from these perturbations to recover the core underlying logic. By pitting these competing objectives against each other, Seir\^enes compels the model to move beyond superficial pattern matching and anchors its capabilities in robust underlying reasoning. This continuous interaction sustains an informative co-evolutionary curriculum as the model improves. Across seven mathematical reasoning benchmarks and model scales from 4B to 30B, Seir\^enes achieves average gains of +10.2, +9.1, and +7.2 points. Besides, distracting contexts produced by the 4B Seir\^enes model reduce the accuracy of top-tier closed-source models (GPT and Gemini) by roughly 4--5 points, revealing Seir\^enes' general ability to uncover reasoning models' blind spots.

Chi Zhang, Haibo Qiu, Qiming Zhang, Yufei Xu, Xinbo Gao, Jing Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2025
Accuracy74.1
214
Mathematical ReasoningIMO-Bench
Accuracy46.7
57
Mathematical ReasoningAIME 2026
AIME 2026 Accuracy79.7
55
Mathematical ReasoningHMMT 2026
Accuracy49
16
Mathematical ReasoningMathematical Reasoning Suite Overall
Average Score63.9
16
Mathematical ReasoningMath-Perturb
Math-P Hard Score79.1
5
Mathematical ReasoningRobustness Evaluation Suite (GSMIR, MMLU-P, OBook-P)
GSMIR Score3.05
5
Showing 7 of 7 rows

Other info

Follow for update