Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AssistantBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reward ModelingAssistantBench
Pairwise Accuracy89.17
13
Web ReasoningAssistantBench
Accuracy31.8
8
Agentic Task PerformanceAssistantBench (test)
Easy Accuracy65.8
6
Showing 3 of 3 rows