Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ReaderLM-v2: Small Language Model for HTML to Markdown and JSON

About

We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20\% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.

Feng Wang, Zesheng Shi, Bo Wang, Nan Wang, Han Xiao• 2025

Related benchmarks

TaskDatasetResultRank
Multi-hop Question AnsweringMulti-hop RAG
F178.42
65
Question AnsweringTriviaQA
F179.74
46
RAG Question AnsweringNQ (Natural Questions)
F1 Score44.35
20
RAG Question AnsweringMuSiQue
F1 Score8.75
20
RAG Question AnsweringHotpotQA
F1 Score30.59
20
Main HTML ExtractionMainWebBench
ROUGE-N F1 (All)22.64
15
Web Content ExtractionWCEB 1.0 (all)
ROUGE-N F130.77
14
Web Content ExtractionWCEB 1.0 (simple)
ROUGE-N F137.18
14
Web Content ExtractionWCEB 1.0 (mid)
ROUGE-N F129.28
14
Web Content ExtractionWCEB hard 1.0
ROUGE-N F126.36
14
Showing 10 of 12 rows

Other info

Follow for update