Textual Equilibrium Propagation for Deep Compound AI Systems

About

Large language models (LLMs) are increasingly deployed as part of compound AI systems that coordinate multiple modules (e.g., retrievers, tools, verifiers) over long-horizon workflows. Recent approaches that propagate textual feedback globally (e.g., TextGrad) make it feasible to optimize such pipelines, but we find that performance degrades as system depth grows. In particular, long-horizon agentic workflows exhibit two depth-scaling failure modes: 1) exploding textual gradient, where textual feedback grows exponentially with depth, leading to prohibitively long message and amplifies evaluation biases; and 2) vanishing textual gradient, where limited long-context ability causes models overemphasize partial feedback and compression of lengthy feedback causes downstream messages to lose specificity gradually as they propagate many hops upstream. To mitigate these issues, we introduce Textual Equilibrium Propagation (TEP), a local learning principle inspired by Equilibrium Propagation in energy-based models. TEP includes two phases: 1) a free phase where a local LLM critics iteratively refine prompts until reaching equilibrium (no further improvements are suggested); and 2) a nudged phase which applies proximal prompt edits with bounded modification intensity, using task-level objectives that propagate via forward signaling rather than backward feedback chains. This design supports local prompt optimization followed by controlled adaptation toward global goals without the computational burden and signal degradation of global textual backpropagation. Across long-horizon QA benchmarks and multi-agent tool-use dataset, TEP consistently improves accuracy and efficiency over global propagation methods such as TextGrad. The gains grows with depth, while preserving the practicality of black-box LLM components in deep compound AI system.

Minghui Chen, Wenlong Deng, James Zou, Han Yu, Xiaoxiao Li• 2026

Related benchmarks

Task	Dataset	Result
Question Answering	PubMedQA (test)	Accuracy62.02	170
Complex Retrieval	STARK-PRIME (test)	MRR42.72	6
RAG	HotpotQA (test)	F1 Score48.72	6
Verified Code Gen.	BigCodeBench (test)	Pass Rate38.97	6
Solution Optimization	GPQA	Accuracy44.5	4
Solution Optimization	Object Counting (adapted)	Accuracy81.6	4

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord