Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TextGrad: Automatic "Differentiation" via Text

About

AI is undergoing a paradigm shift, with breakthroughs achieved by systems orchestrating multiple large language models (LLMs) and other complex components. As a result, developing principled and automated optimization methods for compound AI systems is one of the most important new challenges. Neural networks faced a similar challenge in its early days until backpropagation and automatic differentiation transformed the field by making optimization turn-key. Inspired by this, we introduce TextGrad, a powerful framework performing automatic ``differentiation'' via text. TextGrad backpropagates textual feedback provided by LLMs to improve individual components of a compound AI system. In our framework, LLMs provide rich, general, natural language suggestions to optimize variables in computation graphs, ranging from code snippets to molecular structures. TextGrad follows PyTorch's syntax and abstraction and is flexible and easy-to-use. It works out-of-the-box for a variety of tasks, where the users only provide the objective function without tuning components or prompts of the framework. We showcase TextGrad's effectiveness and generality across a diverse range of applications, from question answering and molecule optimization to radiotherapy treatment planning. Without modifying the framework, TextGrad improves the zero-shot accuracy of GPT-4o in Google-Proof Question Answering from $51\%$ to $55\%$, yields $20\%$ relative performance gain in optimizing LeetCode-Hard coding problem solutions, improves prompts for reasoning, designs new druglike small molecules with desirable in silico binding, and designs radiation oncology treatment plans with high specificity. TextGrad lays a foundation to accelerate the development of the next-generation of AI systems.

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, James Zou• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K (test)
Accuracy74.8
751
Mathematical ReasoningGSM8K
Accuracy81.1
212
Natural Language InferenceSNLI
Accuracy93
174
Commonsense ReasoningStrategyQA (test)
Accuracy68.8
81
Safety EvaluationUnsafeBench
F1 Score78
24
Prompt OptimizationPrompt Optimization Benchmark
Accuracy41.3
24
Preference ClassificationAnthropic HH Harmless (test)
Accuracy70.9
22
Mathematical ReasoningGSM8K
Success Rate82.4
16
Harmlessness preference labeling accuracySafeRLHF-RMB (test)
Bench Accuracy70.3
15
Visual faithfulness evaluationSeeTRUE
F1 Score80
15
Showing 10 of 46 rows

Other info

Code

Follow for update