ScribeAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data

About

Large Language Model (LLM) agents are rapidly improving to handle increasingly complex web-based tasks. Most of these agents rely on general-purpose, proprietary models like GPT-4 and focus on designing better prompts to improve their planning abilities. However, general-purpose LLMs are not specifically trained to understand specialized web contexts such as HTML, and they often struggle with long-horizon planning. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data collected from over 250 domains corresponding to 6 billion tokens. This simple yet effective approach shows substantial gains over prompting-based agents on existing benchmarks -- ScribeAgent achieves state-of-the-art direct generation performance on Mind2Web and improves the task success rate by 7.3% over the previous best text-only web agents on WebArena. We further perform detailed ablation studies on various fine-tuning design choices and provide insights into LLM selection, training recipes, context window optimization, and effect of dataset sizes.

Junhong Shen, Atishay Jain, Zedian Xiao, Ishan Amlekar, Mouad Hadji, Aaron Podolny, Ameet Talwalkar• 2024

Related benchmarks

Task	Dataset	Result
Web navigation	WebArena	--	55
GUI Navigation	Multimodal-Mind2Web Cross-Website	Step Success Rate32.5	37
GUI Navigation	Multimodal-Mind2Web Cross-Domain	Step Success Rate37.3	32
GUI Navigation	Multimodal-Mind2Web Cross-Task	Step Success Rate35.6	32
Web navigation	Multimodal-Mind2Web Cross-Website	Element Accuracy34.1	15
Web navigation	Multimodal-Mind2Web Cross-Domain	Element Accuracy39.4	15
Web navigation	Multimodal-Mind2Web Cross-Task	Element Accuracy38	15
Web navigation	Multimodal-Mind2Web Average	Avg. Step Success Rate35.1	14

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord