Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Complex reasoning on FEVER (val)
Loading...
84.67
Macro-F1
EvoPool
1.4388
23.0469
44.655
66.2631
Jun 1, 2026
Macro-F1
Updated 1d ago
Evaluation Results
Method
Method
Links
Macro-F1
EvoPool
Backbone=gpt-4o-mini
2026.06
84.67
LLM annotation
Backbone=gpt-4o-mini
2026.06
79.32
Alchemist
Backbone=gpt-4o-mini
2026.06
16.65
DataSculpt
Backbone=gpt-4o-mini
2026.06
4.64
Feedback
Search any
task
Search any
task