Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation

About

Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website, while language agents, empowered by large language models (LLMs), exhibit poor reusability in diverse web environments. In this work, we introduce the paradigm of generating web scrapers with LLMs and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently. AutoScraper leverages the hierarchical structure of HTML and similarity across different web pages for generating web scrapers. Besides, we propose a new executability metric for better measuring the performance of web scraper generation tasks. We conduct comprehensive experiments with multiple LLMs and demonstrate the effectiveness of our framework. Resources of this paper can be found at \url{https://github.com/EZ-hwh/AutoScraper}

Wenhao Huang, Zhouhong Gu, Chenghao Peng, Zhixu Li, Jiaqing Liang, Yanghua Xiao, Liqian Wen, Zulong Chen• 2024

Related benchmarks

TaskDatasetResultRank
Web Information ExtractionLIVEWEB-IE Type I 1.0 (test)
Precision53.23
33
Web Information ExtractionLIVEWEB-IE Type II 1.0 (test)
Precision0.4193
33
Web Information ExtractionLIVEWEB-IE Type IV 1.0 (test)
Precision14.36
33
Web Information ExtractionLIVEWEB-IE Overall 1.0 (test)
Precision28.94
33
Web Information ExtractionLIVEWEB-IE Type III 1.0 (test)
Precision13.95
33
Web Information ExtractionSWDE expanded (test)
Precision94.87
32
Web Information ExtractionSWDE original (test)
Precision94.06
32
Showing 7 of 7 rows

Other info

Follow for update