Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

About

Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.

Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu• 2025

Related benchmarks

TaskDatasetResultRank
Multi-hop Question Answering2WikiMultihopQA--
559
Single-hop Question AnsweringPopQA--
186
Single-hop Question AnsweringTriviaQA--
133
Tool UseToolBench
Average Pass Rate48.5
53
Travel PlanningTravelPlanner
Average Tokens Used16.2
46
Code GenerationHumanEval OOD
Pass@193.75
39
Broad Information SeekingWideSearch
Item F1 (Avg@4)28.7
34
Question AnsweringHotpotQA In-Distribution
F1 Score90.11
23
Question AnsweringGAIA
Accuracy (Pass@4)7.09
22
Tool UseEvaluation Dataset
Accuracy48.23
20
Showing 10 of 33 rows

Other info

Follow for update