REVOLVE: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization

About

Recent advancements in large language models (LLMs) have significantly enhanced the ability of LLM-based systems to perform complex tasks through natural language processing and tool interaction. However, optimizing these LLM-based systems for specific tasks remains challenging, often requiring manual interventions like prompt engineering and hyperparameter tuning. Existing automatic optimization methods, such as textual feedback-based techniques (e.g., TextGrad), tend to focus on immediate feedback, analogous to using immediate derivatives in traditional numerical gradient descent. However, relying solely on such feedback can be limited when the adjustments made in response to this feedback are either too small or fluctuate irregularly, potentially slowing down or even stalling the optimization process. To overcome these challenges, more adaptive methods are needed, especially in situations where the system's response is evolving slowly or unpredictably. In this paper, we introduce REVOLVE, an optimization method that tracks how "R"esponses "EVOLVE" across iterations in LLM systems. By focusing on the evolution of responses over time, REVOLVE enables more stable and effective optimization by making thoughtful, progressive adjustments at each step. Experimental results demonstrate that REVOLVE outperforms competitive baselines, achieving a 7.8% improvement in prompt optimization, a 20.72% gain in solution refinement, and a 29.17% increase in code optimization. Additionally, REVOLVE converges in fewer iterations, resulting in significant computational savings. Beyond its practical contributions, REVOLVE highlights a promising direction, where the rich knowledge from established optimization principles can be leveraged to enhance LLM systems, which paves the way for further advancements in this hybrid domain.

Peiyan Zhang, Haibo Jin, Leyang Hu, Xinnuo Li, Liying Kang, Man Luo, Yangqiu Song, Haohan Wang• 2024

Related benchmarks

Task	Dataset	Result
Arithmetic Reasoning	MultiArith (test)	Accuracy96.2	136
Arithmetic Reasoning	SVAMP (test)	Accuracy89.2	84
Logical deduction	BBH Logical Deduction (Seven Objects) (test)	Accuracy50.6	22
tracking shuffled objects seven objects	BBH (test)	Accuracy87.7	20
Tracking Shuffled Objects	Tracking Shuffled Objects 5 objects (test)	Accuracy (TSO 5-obj)86.8	16
Logical deduction	Logical Deduction 5 objects (test)	Accuracy57.7	16

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord