Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution

About

We show that verifier-free evolution is bottlenecked by both diversity and efficiency: without external correction, repeated evolution accelerates collapse toward narrow modes, while the uniform use of a high-cost model wastes compute and quickly becomes economically impractical. We introduce Squeeze Evolve, a unified multi-model orchestration framework for verifier-free evolutionary inference. Our approach is guided by a simple principle: allocate model capability where it has the highest marginal utility. Stronger models are reserved for high-impact stages, while cheaper models handle the other stages at much lower costs. This principle addresses diversity and cost-efficiency jointly while remaining lightweight. Squeeze Evolve naturally supports open-source, closed-source, and mixed-model deployments. Across AIME 2025, HMMT 2025, LiveCodeBench V6, GPQA-Diamond, ARC-AGI-V2, and multimodal vision benchmarks, such as MMMU-Pro and BabyVision, Squeeze Evolve consistently improves the cost-capability frontier over single-model evolution and achieves new state-of-the-art results on several tasks. Empirically, Squeeze Evolve reduces API cost by up to $\sim$3$\times$ and increases fixed-budget serving throughput by up to $\sim$10$\times$. Moreover, on discovery tasks, Squeeze Evolve is the first verifier-free evolutionary method to match, and in some cases exceed, the performance of verifier-based evolutionary methods.

Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang, Coleman Hooper, Yuezhou Hu, Rishabh Tiwari, Jue Wang, Harman Singh, Qingyang Wu, Yuqing Jian, Ce Zhang, Kurt Keutzer, Tri Dao, Xiaoxia Wu, Ben Athiwaratkun, James Zou, Chenfeng Xu• 2026

Related benchmarks

TaskDatasetResultRank
ReasoningGPQA Diamond
Accuracy83.6
135
Circle packingCircle Packing (n=26)
Sum of Radii2.6359
9
Code GenerationLCB v6
Throughput (Req/s)31.42
6
MathematicsAIME 25
Req/s39.43
6
MathematicsHMMT25
Throughput (Req/s)16.83
6
Question AnsweringGPQA Diamond
Throughput (Req/s)86.3
6
ReasoningARC-AGI public evaluation set V2
Accuracy97.5
6
Mathematical ReasoningHMMT 2025 (test)
Accuracy93.1
4
Multimodal Vision EvaluationMMMU-Pro
Accuracy79.06
2
Code GenerationLiveCodeBench v6
Accuracy75.6
2
Showing 10 of 11 rows

Other info

Follow for update