Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SEAL

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reasoning over conflicting evidenceSEAL-0
Accuracy45.9
14
Deep SearchSEAL 0
Score41.44
11
Agent Capability EvaluationSEAL 0
Average Score (@8)61.3
10
Agent Tool-use and ReasoningSEAL (test)
Pass@351.97
8
Fact-seeking Question AnsweringSEAL-0
Accuracy10.8
4
Showing 5 of 5 rows