Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
About
We show that verifier-free evolution is bottlenecked by both diversity and efficiency: without external correction, repeated evolution accelerates collapse toward narrow modes, while the uniform use of a high-cost model wastes compute and quickly becomes economically impractical. We introduce Squeeze Evolve, a unified multi-model orchestration framework for verifier-free evolutionary inference. Our approach is guided by a simple principle: allocate model capability where it has the highest marginal utility. Stronger models are reserved for high-impact stages, while cheaper models handle the other stages at much lower costs. This principle addresses diversity and cost-efficiency jointly while remaining lightweight. Squeeze Evolve naturally supports open-source, closed-source, and mixed-model deployments. Across AIME 2025, HMMT 2025, LiveCodeBench V6, GPQA-Diamond, ARC-AGI-V2, and multimodal vision benchmarks, such as MMMU-Pro and BabyVision, Squeeze Evolve consistently improves the cost-capability frontier over single-model evolution and achieves new state-of-the-art results on several tasks. Empirically, Squeeze Evolve reduces API cost by up to $\sim$3$\times$ and increases fixed-budget serving throughput by up to $\sim$10$\times$. Moreover, on discovery tasks, Squeeze Evolve is the first verifier-free evolutionary method to match, and in some cases exceed, the performance of verifier-based evolutionary methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reasoning | GPQA Diamond | Accuracy83.6 | 135 | |
| Circle packing | Circle Packing (n=26) | Sum of Radii2.6359 | 9 | |
| Code Generation | LCB v6 | Throughput (Req/s)31.42 | 6 | |
| Mathematics | AIME 25 | Req/s39.43 | 6 | |
| Mathematics | HMMT25 | Throughput (Req/s)16.83 | 6 | |
| Question Answering | GPQA Diamond | Throughput (Req/s)86.3 | 6 | |
| Reasoning | ARC-AGI public evaluation set V2 | Accuracy97.5 | 6 | |
| Mathematical Reasoning | HMMT 2025 (test) | Accuracy93.1 | 4 | |
| Multimodal Vision Evaluation | MMMU-Pro | Accuracy79.06 | 2 | |
| Code Generation | LiveCodeBench v6 | Accuracy75.6 | 2 |