Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-agent Reasoning on Ultrafeedback
Loading...
73.66
Accuracy
OW-L
70.0928
71.0189
71.945
72.8711
Oct 1, 2025
Accuracy
Updated 14d ago
Evaluation Results
Method
Method
Links
Accuracy
OW-L
Ensemble=GPT-4o-2024-1...
2025.10
73.66
OW-I
Ensemble=GPT-4o-2024-1...
2025.10
73.66
ISP
Ensemble=GPT-4o-2024-1...
2025.10
73.26
Single Best
Ensemble=GPT-4o-2024-1...
2025.10
73.14
OW-I
Ensemble Size=All Eigh...
2025.10
72.44
OW-L
Ensemble Size=All Eigh...
2025.10
72.44
MV
Ensemble=GPT-4o-2024-1...
2025.10
72.21
ISP
Ensemble Size=All Eigh...
2025.10
71.18
MV
Ensemble Size=All Eigh...
2025.10
70.23
Feedback
Search any
task
Search any
task