Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs
About
We study a class of optimization problems motivated by automating the design and update of AI systems like coding assistants, robots, and copilots. AutoDiff frameworks, like PyTorch, enable efficient end-to-end optimization of differentiable systems. However, general computational workflows can be non-differentiable and involve rich feedback (e.g. console output or user's responses), heterogeneous parameters (e.g. prompts, codes), and intricate objectives (beyond maximizing a score). We investigate end-to-end generative optimization -- using generative models such as LLMs within the optimizer for automatic updating of general computational workflows. We discover that workflow execution traces are akin to back-propagated gradients in AutoDiff and can provide key information to interpret feedback for efficient optimization. Formally, we frame a new mathematical setup, Optimization with Trace Oracle (OPTO). In OPTO, an optimizer receives an execution trace along with feedback on the computed output and updates parameters iteratively. We provide a Python library, Trace, that efficiently converts a workflow optimization problem into an OPTO instance using PyTorch-like syntax. Using Trace, we develop a general LLM-based generative optimizer called OptoPrime. In empirical studies, we find that OptoPrime is capable of first-order numerical optimization, prompt optimization, hyper-parameter tuning, robot controller design, code debugging, etc., and is often competitive with specialized optimizers for each domain. We envision Trace as an open research platform for devising novel generative optimizers and developing the next generation of interactive learning agents. Website: https://microsoft.github.io/Trace/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Success Rate82.5 | 16 | |
| Prompt Optimization | HotpotQA, IFBench, HoVer, PUPA, AIME, and LiveBench-Math 2018-2025 (test) | HotpotQA Score60.33 | 8 | |
| LLM Workflow Optimization | BIG-Bench Hard (BBH) (test) | BBH Overall Accuracy78.6 | 6 | |
| Question Answering | Google-proof QA | Success Rate59.6 | 4 | |
| College Physics | MMLU College Physics 1.0 (test) | Success Rate94.1 | 4 | |
| Counting | Big-Bench Hard Counting | Success Rate89.4 | 4 | |
| Machine Learning | MMLU Machine Learning 1.0 (test) | Accuracy86.6 | 4 | |
| Word Sorting | Big-Bench Hard Word Sorting | Success Rate71.6 | 4 |