Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Internal Benchmark

Benchmarks

Task NameDataset NameSOTA ResultTrend
Human Preference Predictioninternal benchmark
Accuracy82.2
14
Mathematical ReasoningInternal Benchmark
Average Score65.5
5
Agent Action Safety Verificationinternal benchmark 300-scenario
Verdict Accuracy95
5
General Multimodal Intelligence EvaluationInternal Benchmark (test)
Overall Score61.6
5
Showing 4 of 4 rows