Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Efficient Crawling for Scalable Web Data Acquisition (Extended Version)

About

Journalistic fact-checking, as well as social or economic research, require analyzing high-quality statistics datasets (SDs, in short). However, retrieving SD corpora at scale may be hard, inefficient, or impossible, depending on how they are published online. To improve open statistics data accessibility, we present a focused Web crawling algorithm that retrieves as many targets, i.e., resources of certain types, as possible, from a given website, in an efficient and scalable way, by crawling (much) less than the full website. We show that optimally solving this problem is intractable, and propose an approach based on reinforcement learning, namely using sleeping bandits. We propose SB-CLASSIFIER, a crawler that efficiently learns which hyperlinks lead to pages that link to many targets, based on the paths leading to the links in their enclosing webpages. Our experiments on websites with millions of webpages show that our crawler is highly efficient, delivering high fractions of a site's targets while crawling only a small part.

Antoine Gauquier, Ioana Manolescu, Pierre Senellart• 2026

Related benchmarks

TaskDatasetResultRank
Targeted Web CrawlingWebsite
Target Retrieval Success (90%)74.4
7
Target RetrievalWebsites (18 distinct)
be29.5
7
Targeted Web CrawlingWebsite be
Retrieval Success Rate (90%)75.7
7
Targeted Web CrawlingWebsite cn
Retrieval Success Rate (90% Target)70.9
7
Targeted Web CrawlingWebsite ed
Retrieval Success Rate (90% Target)51.5
7
Targeted Web CrawlingWebsite is
Target Retrieval Success Rate (90%)76
7
Targeted Web CrawlingWebsite ju
Retrieval Success Rate (90%)35.8
7
Targeted Web CrawlingWebsite nc
Target Retrieval Success Rate (90%)51.6
7
Targeted Web CrawlingWebsite oe
Retrieval Success Rate (90% Target)59.2
7
Targeted Web CrawlingWebsite ok
Target Retrieval 90% Coverage15.5
7
Showing 10 of 19 rows

Other info

Follow for update