Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs

About

We study a class of optimization problems motivated by automating the design and update of AI systems like coding assistants, robots, and copilots. AutoDiff frameworks, like PyTorch, enable efficient end-to-end optimization of differentiable systems. However, general computational workflows can be non-differentiable and involve rich feedback (e.g. console output or user's responses), heterogeneous parameters (e.g. prompts, codes), and intricate objectives (beyond maximizing a score). We investigate end-to-end generative optimization -- using generative models such as LLMs within the optimizer for automatic updating of general computational workflows. We discover that workflow execution traces are akin to back-propagated gradients in AutoDiff and can provide key information to interpret feedback for efficient optimization. Formally, we frame a new mathematical setup, Optimization with Trace Oracle (OPTO). In OPTO, an optimizer receives an execution trace along with feedback on the computed output and updates parameters iteratively. We provide a Python library, Trace, that efficiently converts a workflow optimization problem into an OPTO instance using PyTorch-like syntax. Using Trace, we develop a general LLM-based generative optimizer called OptoPrime. In empirical studies, we find that OptoPrime is capable of first-order numerical optimization, prompt optimization, hyper-parameter tuning, robot controller design, code debugging, etc., and is often competitive with specialized optimizers for each domain. We envision Trace as an open research platform for devising novel generative optimizers and developing the next generation of interactive learning agents. Website: https://microsoft.github.io/Trace/.

Ching-An Cheng, Allen Nie, Adith Swaminathan• 2024

Related benchmarks

Task	Dataset	Result
Instruction Following	IFBench (test)	Score51.19	16
Mathematical Reasoning	GSM8K	Success Rate82.5	16
Prompt Optimization	HotpotQA, IFBench, HoVer, PUPA, AIME, and LiveBench-Math 2018-2025 (test)	HotpotQA Score60.33	8
Multi-hop Question Answering	HotpotQA tool-augmented 1 (test)	EM60.33	7
Fact Extraction and Claim Verification	HoVer (test)	Recall46	7
Privacy-conscious Delegation	PUPA (test)	Score74.18	7
LLM Workflow Optimization	BIG-Bench Hard (BBH) (test)	BBH Overall Accuracy78.6	6
Question Answering	Google-proof QA	Success Rate59.6	4
College Physics	MMLU College Physics 1.0 (test)	Success Rate94.1	4
Counting	Big-Bench Hard Counting	Success Rate89.4	4

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord