Share your thoughts, 1 month free Claude Pro on us
See more
Feedback
Search any
task
Search any
task
SOTA Instruction Following Evaluation benchmarks and papers with code | Wizwand
Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Tasks
Instruction Following Evaluation
Benchmarks
Dataset Name
SOTA Method
Dataset Name
SOTA Method
Metric
Trend
Results
Last Updated
Average (Vicuna, Self-instruct, Dolly, BPO) (test)
BPO-aligned gpt-3.5-turbo
Delta Win Rate (ΔWR)
22
24
1mo ago
IFEval
DS-V3.1-Terminus (no_think)
IFEval Score
86.69
20
1mo ago
Vicuna Out-of-Distribution
SODA
GPT-4o Score
51.9
17
11d ago
SelfInst Out-of-Distribution
SODA
GPT-4o Score
51.6
17
11d ago
Dolly Out-of-Distribution
SODA
GPT-4o Score
49.9
17
11d ago
LMSYS In-Dist.
SODA
GPT-4o Score
51.8
17
11d ago
AlpacaEval 2
VRM-PPO
Win Rate
48.14
16
1mo ago
ArenaHard v1
+RL (Skywork-Reward-V2-Llama-3.1-8B)
ArenaHardv1 Score
38
14
1mo ago
AlpacaEval 2.0 (test)
DAR
LC% over π0
54.17
10
1mo ago
Ours hard seed data
GPT-4 Turbo
Score
56.73
5
1mo ago
SELF-INSTRUCT Ours
GPT-4 Turbo
Score
74.29
5
1mo ago
SELF-INSTRUCT
GPT-4 Turbo
Score
69.48
5
1mo ago
SELF-INSTRUCT seed data
GPT-4 Turbo
Score
72.01
5
1mo ago
Instruction Tuning with GPT-4
Claude3
Score
71.29
5
1mo ago
WizardLM
GPT-4 Turbo
Score
72.06
5
1mo ago
BPO Eval (test)
BPO
A Win Rate
58.5
5
1mo ago
Dolly Eval
BPO
A Win Rate
62
5
1mo ago
Self-instruct Eval
BPO
Win Rate (A)
56.7
5
1mo ago
Vicuna Eval
BPO
Win Rate (A)
63.8
5
1mo ago
IFEval (random subset of 50 prompts)
DIRECTER
Task Fidelity
85.9
3
1mo ago
IFEval (dev)
GPT-4
Accuracy
92
3
1mo ago
Showing 21 of 21 rows
25 / page
50 / page
100 / page
1
Search any
task
Search any
task
Privacy Policy
Terms of Service
FAQs
Swarm Docs