Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SEAL

Benchmarks

Task NameDataset NameSOTA ResultTrend
Agent Capability EvaluationSEAL 0
Average Score (@8)61.3
19
Reasoning over conflicting evidenceSEAL-0
Accuracy45.9
14
Complex information-seekingSeal-0
Accuracy56.2
11
Deep SearchSEAL 0
Score41.44
11
Complex ReasoningSeal-0
Accuracy (Seal-0)53.4
8
Agent Tool-use and ReasoningSEAL (test)
Pass@351.97
8
Fact-seeking Question AnsweringSEAL-0
Accuracy10.8
4
AgentSEAL
Score57.4
2
Showing 8 of 8 rows