ReaderLM-v2: Small Language Model for HTML to Markdown and JSON

About

We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20\% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.

Feng Wang, Zesheng Shi, Bo Wang, Nan Wang, Han Xiao• 2025

Related benchmarks

Task	Dataset	Result
Multi-hop Question Answering	Multi-hop RAG	F178.42	77
Question Answering	TriviaQA	F179.74	46
RAG Question Answering	NQ (Natural Questions)	F1 Score44.35	20
RAG Question Answering	MuSiQue	F1 Score8.75	20
RAG Question Answering	HotpotQA	F1 Score30.59	20
Main HTML Extraction	MainWebBench	ROUGE-N F1 (All)22.64	15
Web Content Extraction	WCEB 1.0 (all)	ROUGE-N F130.77	14
Web Content Extraction	WCEB 1.0 (simple)	ROUGE-N F137.18	14
Web Content Extraction	WCEB 1.0 (mid)	ROUGE-N F129.28	14
Web Content Extraction	WCEB hard 1.0	ROUGE-N F126.36	14

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord