Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AXE: Low-Cost Cross-Domain Web Structured Information Extraction

About

Extracting structured data from the web is often a trade-off between the brittle nature of manual heuristics and the prohibitive cost of Large Language Models. We introduce AXE (Adaptive X-Path Extractor), a pipeline that rethinks this process by treating the HTML DOM as a tree that needs pruning rather than just a wall of text to be read. AXE uses a specialized "pruning" mechanism to strip away boilerplate and irrelevant nodes, leaving behind a distilled, high-density context that allows a tiny 0.6B LLM to generate precise, structured outputs. To keep the model honest, we implement Grounded XPath Resolution (GXR), ensuring every extraction is physically traceable to a source node. Despite its low footprint, AXE achieves state-of-the-art zero-shot performance, outperforming several much larger, fully-trained alternatives with an F1 score of 88.1% on the SWDE dataset. By releasing our specialized adaptors, we aim to provide a practical, cost-effective path for large-scale web information extraction.

Abdelrahman Mansour, Khaled W. Alshaer, Moataz Elsaban• 2026

Related benchmarks

TaskDatasetResultRank
Question AnsweringWebSRC (dev)
EM80.06
26
Question AnsweringWebSRC (test)
EM67.6
17
Structured Web Data ExtractionSWDE all domains (test)
F1 Score88.1
10
Showing 3 of 3 rows

Other info

Follow for update