Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Auto-GPT for Online Decision Making: Benchmarks and Additional Opinions

About

Auto-GPT is an autonomous agent that leverages recent advancements in adapting Large Language Models (LLMs) for decision-making tasks. While there has been a growing interest in Auto-GPT stypled agents, questions remain regarding the effectiveness and flexibility of Auto-GPT in solving real-world decision-making tasks. Its limited capability for real-world engagement and the absence of benchmarks contribute to these uncertainties. In this paper, we present a comprehensive benchmark study of Auto-GPT styled agents in decision-making tasks that simulate real-world scenarios. Our aim is to gain deeper insights into this problem and understand the adaptability of GPT-based agents. We compare the performance of popular LLMs such as GPT-4, GPT-3.5, Claude, and Vicuna in Auto-GPT styled decision-making tasks. Furthermore, we introduce the Additional Opinions algorithm, an easy and effective method that incorporates supervised/imitation-based learners into the Auto-GPT scheme. This approach enables lightweight supervised learning without requiring fine-tuning of the foundational LLMs. We demonstrate through careful baseline comparisons and ablation studies that the Additional Opinions algorithm significantly enhances performance in online decision-making benchmarks, including WebShop and ALFWorld.

Hui Yang, Sifu Yue, Yunzhong He• 2023

Related benchmarks

TaskDatasetResultRank
Safety EvaluationSafe Tasks
JDR-R166.8
12
Safety EvaluationDangerous Tasks
JDR R17.41e+3
12
DiplomacyDiplomacy Press setting
Win Rate0.026
9
Scientific ReasoningScientific Reasoning Subset A
ROUGE-L11.6
8
Hardware ExecutionSubset B Hardware Execution
Scode0.54
7
Long-Horizon StabilityBioProBench Subset C (test)
Success Rate66.7
4
Error CorrectionBioProBench Subset D (test)
Seq Acc0.00e+0
4
Showing 7 of 7 rows

Other info

Follow for update