Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MultiChallenge

Benchmarks

Task NameDataset NameSOTA ResultTrend
Instruction FollowingMultiChallenge (Out-of-Domain)
Overall Score38.5
23
Reverse Chain-of-Thought GenerationMultiChallenge
Score45
20
Instruction FollowingMultiChallenge
Score65.98
10
General-purpose BehaviorMultiChallenge
Score58.6
7
Multi-turn Dialogue ReasoningMultiChallenge
Accuracy32.97
4
Medical Instruction FollowingMultiChallenge
Pass@166.8
4
Relational UnderstandingMultiChallenge
IM Score20.35
2
Showing 7 of 7 rows