Open-ended generation

Benchmarks

Dataset Name	SOTA Method	Metric
Creative Writing Evaluation Prompts	Min-p	Average Judge Score8.12	108	4mo ago
AlpacaEval 2.0	LLAMA-2-CHAT	Win Rate648	49	18d ago
TruthfulQA	FineSteer	BLEURT Score70.13	48	3mo ago
CNN DailyMail	ConfAdapt	ROUGE-L24.3	40	4mo ago
TruthfulQA Without Rejected Samples open-ended (full)	CoCoASIG	Truthfulness74.67	39	4mo ago
TruthfulQA With All Samples open-ended (full)	DoLa	Truthfulness82.75	39	4mo ago
TriviaQA	ActCab	ECE5.18	37	1mo ago
MLLMU-Bench (Retain Set)	PO	ROUGE-L53.1	30	1mo ago
MLLMU-Bench (test)		ROUGE-L34.5	30	1mo ago
WildBench		WildBench0.479	26	4mo ago
Wikitext-103 (test)	DITTO	MAUVE0.96	26	4mo ago
AlpacaEval 1.0	LLAMA-2-CHAT	Win Rate7,904	23	4mo ago
LLaVA-Bench	LACING	GPT-4 Score84.3	21	1mo ago
CARE-pro	GRPO	Score (Seen)19.75	21	2mo ago
SciQ	GrACE	ECE5.21	21	3mo ago
Vicuna		Skywork Reward V2 Score99.1	18	3mo ago
Dolly	Distillable	Skywork Reward V2 Score0.961	18	3mo ago
HelloBench (HB)		HB-A Score84	17	3mo ago
WildBench (test)	Qwen3	WildBench Score64.4	17	4mo ago
TruthfulQA Open-ended	ITI	True Score99.6	16	4mo ago
MM-Vet		MM-Vet Score45.55	14	2mo ago
LLaVA-Bench In-the-Wild		Score109.3	14	2mo ago
Arena-Hard	AR-MAP	Score84.6	14	4mo ago
NQ entity-swapped (test)	HICD	Exact Match73.73	12	4mo ago
XSum 1,000 samples (test)	DoLA	ROUGE-L23.11	12	4mo ago

Showing 25 of 56 rows