Instruction Following Evaluation

Benchmarks

Dataset Name	SOTA Method	Metric
IFEval	DS-V3.1-Terminus (no_think)	IFEval Score86.69	32	2mo ago
PPE-IFEval	Rubric-ARROW-voting@5	Score76	24	1mo ago
Average (Vicuna, Self-instruct, Dolly, BPO) (test)	BPO-aligned gpt-3.5-turbo	Delta Win Rate (ΔWR)22	24	4mo ago
IFBench	Rubric-ARROW-voting@5	Score73.2	23	1mo ago
InfoBench	RM-R1-32B (Qwen-2.5-Inst)	Score86.1	23	1mo ago
IFEval Inverse	Qwen3-30B	Accuracy83.7	18	2mo ago
Vicuna Out-of-Distribution	SODA	GPT-4o Score51.9	17	3mo ago
SelfInst Out-of-Distribution	SODA	GPT-4o Score51.6	17	3mo ago
Dolly Out-of-Distribution	SODA	GPT-4o Score49.9	17	3mo ago
LMSYS In-Dist.	SODA	GPT-4o Score51.8	17	3mo ago
AlpacaEval 2	VRM-PPO	Win Rate48.14	16	4mo ago
FollowBench OOD		HSR78.06	14	1mo ago
ArenaHard v1	+RL (Skywork-Reward-V2-Llama-3.1-8B)	ArenaHardv1 Score38	14	4mo ago
AlpacaEval 2.0 (test)	DAR	LC% over π054.17	10	4mo ago
HelpSteer2 (val)	SEE	Quality71.8	9	1mo ago
Alpaca-Eval	FiMi-RM	Length-Controlled Win Rate62.17	8	1mo ago
BelleEval	C-DPO	Score87	6	2mo ago
Ours hard seed data		Score56.73	5	4mo ago
SELF-INSTRUCT Ours		Score74.29	5	4mo ago
SELF-INSTRUCT		Score69.48	5	4mo ago
SELF-INSTRUCT seed data		Score72.01	5	4mo ago
Instruction Tuning with GPT-4	Claude3	Score71.29	5	4mo ago
WizardLM		Score72.06	5	4mo ago
BPO Eval (test)	BPO	A Win Rate58.5	5	4mo ago
Dolly Eval	BPO	A Win Rate62	5	4mo ago

Showing 25 of 29 rows