Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Binary Fact-checking on MediaSum
Loading...
85.4
Macro-F1
Claude-3.7-Sonnet
49.416
58.758
68.1
77.442
Jan 10, 2026
Macro-F1
Updated 4d ago
Evaluation Results
Method
Method
Links
Macro-F1
Claude-3.7-Sonnet
Model Category=The Sta...
2026.01
85.4
o3
Model Category=The Sta...
2026.01
82.9
InFi-Checker-Qwen
Model Category=Special...
2026.01
80.4
GPT-5
Model Category=The Sta...
2026.01
80.2
FactCG
Model Category=Special...
2026.01
79.1
Qwen3-8B
Model Category=The Ope...
2026.01
77.7
GPT-4.1
Model Category=The Sta...
2026.01
75.9
AlignScore-large
Model Category=Special...
2026.01
75.8
MiniCheck
Model Category=Special...
2026.01
74.3
InFi-Checker-Llama
Model Category=Special...
2026.01
73.5
GPT-4o
Model Category=The Sta...
2026.01
71.5
ClearCheck (COT)
Model Category=Special...
2026.01
67.8
DeepSeek-V3.2-NoThink
Model Category=The Sta...
2026.01
65.5
Llama-3.1-8B-Instruct
Model Category=The Ope...
2026.01
50.8
Feedback
Search any
task
Search any
task