ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization
About
Recent research has leveraged large language model multi-agent systems for complex problem-solving while trying to reduce the manual effort required to build them, driving the development of automated agent workflow optimization methods. However, existing methods remain inflexible due to representational limitations, a lack of adaptability, and poor scalability when relying on discrete optimization techniques. We address these challenges with ScoreFlow, a simple yet high-performance framework that leverages efficient gradient-based optimization in a continuous space. ScoreFlow incorporates Score-DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback. Across six benchmarks spanning question answering, coding, and mathematical reasoning, ScoreFlow achieves an 8.2% improvement over existing baselines. Moreover, it empowers smaller models to outperform larger ones with lower inference costs. Project: https://github.com/Gen-Verse/ScoreFlow
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 24 | Accuracy28.9 | 318 | |
| Mathematical Reasoning | AIME 25 | Pass@1 Accuracy16.7 | 178 | |
| Mathematical Reasoning | AIME 25 | Accuracy20 | 112 | |
| Code Generation | LiveCodeBench | Accuracy25.9 | 84 | |
| Reasoning | DROP | Score86.14 | 42 | |
| Code Generation | CodeContests | Accuracy13.3 | 30 | |
| Code Generation | APPS | Accuracy26.5 | 29 | |
| Mathematical Reasoning | AIME 2024 and 2025 (test) | Overall Performance Rate57.14 | 18 | |
| Scientific problem solving | SciBench | Pass@2034.2 | 17 | |
| Coding | MBPP | Solve Rate82.69 | 15 |