Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data

About

Large Language Model (LLM) agents are rapidly improving to handle increasingly complex web-based tasks. Most of these agents rely on general-purpose, proprietary models like GPT-4 and focus on designing better prompts to improve their planning abilities. However, general-purpose LLMs are not specifically trained to understand specialized web contexts such as HTML, and they often struggle with long-horizon planning. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data collected from over 250 domains corresponding to 6 billion tokens. This simple yet effective approach shows substantial gains over prompting-based agents on existing benchmarks -- ScribeAgent achieves state-of-the-art direct generation performance on Mind2Web and improves the task success rate by 7.3% over the previous best text-only web agents on WebArena. We further perform detailed ablation studies on various fine-tuning design choices and provide insights into LLM selection, training recipes, context window optimization, and effect of dataset sizes.

Junhong Shen, Atishay Jain, Zedian Xiao, Ishan Amlekar, Mouad Hadji, Aaron Podolny, Ameet Talwalkar• 2024

Related benchmarks

TaskDatasetResultRank
GUI NavigationMultimodal-Mind2Web Cross-Website
Step Success Rate32.5
32
GUI NavigationMultimodal-Mind2Web Cross-Domain
Step Success Rate37.3
27
GUI NavigationMultimodal-Mind2Web Cross-Task
Step Success Rate35.6
27
Web navigationWebArena
Overall Avg Success Rate53
23
Web navigationMultimodal-Mind2Web Average
Avg. Step Success Rate35.1
14
Showing 5 of 5 rows

Other info

Follow for update