WritingBench: A Comprehensive Benchmark for Generative Writing

About

Recent advancements in large language models (LLMs) have significantly enhanced text generation capabilities, yet evaluating their performance in generative writing remains a challenge. Existing benchmarks primarily focus on generic text generation or limited in writing tasks, failing to capture the diverse requirements of high-quality written contents across various domains. To bridge this gap, we present WritingBench, a comprehensive benchmark designed to evaluate LLMs across 6 core writing domains and 100 subdomains. We further propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria. This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length. The framework's validity is further demonstrated by its data curation capability, which enables a 7B-parameter model to outperform the performance of GPT-4o in writing. We open-source the benchmark, along with evaluation tools and modular framework components, to advance the development of LLMs in writing.

Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, Shaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue Wu, Qin Jin, Fei Huang• 2025

Related benchmarks

Task	Dataset	Result
Multitask Language Understanding	MMLU	Accuracy69.66	568
Long-context Reasoning	LongBench v2	Average Score28.4	113
Multi-turn Conversation Evaluation	MT-Bench	MT-Bench Score7.34	68
Long-form generation	LongBench Write-en	Sequence Length Success Rate58.77	21
Long-form writing	LongBench-Write	Score93.75	18
Long-form writing	Creative-W.B.	Score79.27	18
Long-form writing	WritingBench	Score84.95	18
Long-form generation	WritingBench length-constrained	L_R Score8.35	14
Educational scholarly writing	EduResearchBench	Overall Score2.44	8

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord