Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Polaris

Benchmarks

Task NameDataset NameSOTA ResultTrend
Mathematical ReasoningPolaris (test)
Cumulative Rollouts (M)4.92
16
Human Correlation for Image Captioning EvaluationPolaris
Kendall's Tau-c63.2
16
Correlation with human judgmentsPolaris (test)
Kendall's Tau-c0.578
16
Multimodal Preference EvaluationPolaris
tau_c57.8
10
Mathematical Reasoning and CodingPolaris
Peak Accuracy@848.79
6
Jailbreak Attack EvaluationPOLARIS
Attack Success Count520
2
Showing 6 of 6 rows