Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BEHEMOTH

Benchmarks

Task NameDataset NameSOTA ResultTrend
Tool UseBEHEMOTH ToolBench (out-of-distribution)
Success Rate26.82
6
Graduate-level ReasoningBEHEMOTH GPQA Diamond (out-of-distribution)
Accuracy50
6
Long-context Memory EvaluationBEHEMOTH LongMemEval (out-of-distribution)
Accuracy63.07
6
Memory ExtractionBEHEMOTH in-distribution (test)
Personalization (MA)65.72
6
Showing 4 of 4 rows