Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Reasoning Quality Evaluation on HelpSteer1 sampled (train)
Loading...
3.89
Usefulness Score
MA-SAPO
3.63
3.6975
3.765
3.8325
Oct 18, 2025
Usefulness Score
Accuracy Score
Consistency Score
Mean Score
Updated 17d ago
Evaluation Results
Method
Method
Links
Usefulness Score
Accuracy Score
Consistency Score
Mean Score
MA-SAPO
Configuration=Multi-ag...
2025.10
3.89
3.87
4.02
3.93
Single-Agent
Configuration=Single-a...
2025.10
3.64
3.63
3.81
3.69
Feedback
Search any
task
Search any
task