AXE: Low-Cost Cross-Domain Web Structured Information Extraction

About

Extracting structured data from the web is often a trade-off between the brittle nature of manual heuristics and the prohibitive cost of Large Language Models. We introduce AXE (Adaptive X-Path Extractor), a pipeline that rethinks this process by treating the HTML DOM as a tree that needs pruning rather than just a wall of text to be read. AXE uses a specialized "pruning" mechanism to strip away boilerplate and irrelevant nodes, leaving behind a distilled, high-density context that allows a tiny 0.6B LLM to generate precise, structured outputs. To keep the model honest, we implement Grounded XPath Resolution (GXR), ensuring every extraction is physically traceable to a source node. Despite its low footprint, AXE achieves state-of-the-art zero-shot performance, outperforming several much larger, fully-trained alternatives with an F1 score of 88.1% on the SWDE dataset. By releasing our specialized adaptors, we aim to provide a practical, cost-effective path for large-scale web information extraction. Our code and adaptors are publicly available at https://github.com/abdo-Mansour/axetract.

Abdelrahman Mansour, Khaled W. Alshaer, Moataz Elsaban• 2026

Related benchmarks

Task	Dataset	Result
Question Answering	WebSRC (dev)	EM80.06	26
Question Answering	WebSRC (test)	EM67.6	17
Structured Web Data Extraction	SWDE all domains (test)	F1 Score88.1	10

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord