Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

About

The explosive growth of Text-to-Image (T2I) models, from large-scale versions to lightweight, real-time ones, now faces diminishing marginal returns from single-model scaling. Agentic T2I methods emerged to alleviate this bottleneck by using multiple models. However, existing agentic T2I methods suffer from three key challenges: reliance on expensive handcrafted priors or human annotations, rigid single-path decision mechanisms, and a neglect of inference efficiency. To address these challenges, we introduce OctoT2I, a novel agentic framework that reformulates the T2I task as a joint optimization of generation quality and inference efficiency. OctoT2I implements a stateful, multi-round routing strategy that adaptively selects the most suitable tool based on its knowledge and memory. This strategy is enabled by a knowledge base built from scratch by our novel Self-Evolving Mechanism. This mechanism, which requires no human supervision, first autonomously defines foundational Conceptual Dimensions (eg, style, color, count) and then intelligently explores their combinations via an iterative" Propose--Solve--Evaluate--Learn"(PSEL) loop. The PSEL loop efficiently discovers each tool's capability frontier, driving continuous improvement without external guidance. Extensive experiments demonstrate that OctoT2I achieves competitive performance (0.96) on GenEval while delivering a 90.3% inference speedup and a 56.6% energy-efficiency gain over the leading baseline (Flow-GRPO), striking an exceptional balance between performance and efficiency. Code and models will be made available.

Xu Jiang, Bin Chen, Gehui Li, Yule Duan, Ronggang Wang, Jian Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
Overall Score96
704
Text-to-Image GenerationT2I-CompBench++
Color0.8344
95
Text-to-Image GenerationWISE
Culture Score0.51
51
Text-to-Image GenerationGenEval
Average Generation Time (s)10.02
6
Text-to-Image Generation30 Real-world user prompts
Total Votes634
2
Showing 5 of 5 rows

Other info

Follow for update