Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FRAMES

Benchmarks

Task NameDataset NameSOTA ResultTrend
Error DetectionFRAMES (test)
Precision97
36
Error DetectionFRAMES
F1 Score95
36
Multi-hop Question AnsweringFRAMES
Accuracy86
34
Multi-hop Question AnsweringFrames
ACCE41.38
24
Long-context Question AnsweringFRAMES
Avg@4 Score73.54
22
Long-context reasoningFRAMES
Score83.5
18
Agentic SearchFrames
String-F136.6
14
Deep ResearchFRAMES
Accuracy56
14
Question AnsweringFRAMES
Accuracy82.5
14
Document-level retrievalFRAMES (test)
Recall73.3
13
Document Question AnsweringFRAMES
EM10.5
13
Multi-hop Reasoning and Fact-checkingFRAMES
Average @390.6
13
Complex ReasoningFrames
Accuracy90.6
13
Information RetrievalFRAMES
Recall81.5
11
Question AnsweringFRAMES out-domain (test)
LasJ31.31
11
Multi-hop Factual ReasoningFRAMES
Accuracy82.3
10
Task-oriented DialogueFrames
Success Rate (SR)50.57
9
Fact Retrieval and AnalysisFRAMES
Accuracy90.6
9
Agentic ReasoningFRAMES n=50 (full)
Accuracy77.31
8
Multi-step Reasoning and FactualityFRAMES
Pass@190.6
7
SearchFrames
Score70.5
7
Deep search QAFrames
Accuracy46.42
6
Evidence RetrievalFRAMES
Evidence Coverage Rate55.8
6
Multi-hop QA RetrievalFRAMES
NDCG0.834
5
Agentic tasksFrames
Accuracy70.45
5
Showing 25 of 40 rows