Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FRAMES

Benchmarks

Task NameDataset NameSOTA ResultTrend
Error DetectionFRAMES (test)
Precision97
36
Error DetectionFRAMES
F1 Score95
36
Multi-hop Question AnsweringFrames
ACCE41.38
24
Long-context Question AnsweringFRAMES
Avg@4 Score73.54
22
Deep ResearchFRAMES
Accuracy56
14
Question AnsweringFRAMES
Accuracy82.5
14
Question AnsweringFRAMES out-domain (test)
LasJ31.31
11
Multi-hop Factual ReasoningFRAMES
Accuracy82.3
10
Fact Retrieval and AnalysisFRAMES
Accuracy90.6
9
Multi-hop Question AnsweringFRAMES
Accuracy50
8
Agentic ReasoningFRAMES n=50 (full)
Accuracy77.31
8
SearchFrames
Score70.5
7
Evidence RetrievalFRAMES
Evidence Coverage Rate55.8
6
Out-of-Distribution EvaluationFrames (OOD)
Avg@457.1
3
Multi-hop Question AnsweringFRAMES Small-scale (evaluation)
Search Count3.2
1
Showing 15 of 15 rows