Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FRAMES

Benchmarks

Task NameDataset NameSOTA ResultTrend
Error DetectionFRAMES (test)
Precision97
36
Error DetectionFRAMES
F1 Score95
36
Multi-hop Question AnsweringFrames
ACCE41.38
24
Multi-hop Question AnsweringFRAMES
Accuracy86
22
Long-context Question AnsweringFRAMES
Avg@4 Score73.54
22
Agentic SearchFrames
String-F136.6
14
Deep ResearchFRAMES
Accuracy56
14
Question AnsweringFRAMES
Accuracy82.5
14
Question AnsweringFRAMES out-domain (test)
LasJ31.31
11
Multi-hop Factual ReasoningFRAMES
Accuracy82.3
10
Fact Retrieval and AnalysisFRAMES
Accuracy90.6
9
Agentic ReasoningFRAMES n=50 (full)
Accuracy77.31
8
SearchFrames
Score70.5
7
Deep search QAFrames
Accuracy46.42
6
Evidence RetrievalFRAMES
Evidence Coverage Rate55.8
6
Multi-hop QA RetrievalFRAMES
NDCG0.834
5
Agentic tasksFrames
Accuracy70.45
5
Multi-hop Question AnsweringFrames out-of-domain
F1 Score0.413
4
Query RoutingFRAMES In-Distribution (test)
CPT (90%)77.9
4
Query RoutingFRAMES OOD
CPT 85%68.74
4
Query RoutingFRAMES
CPT (95%)88.84
4
Query RoutingFRAMES
CPT (90%)78.61
4
Model RoutingFRAMES (ID)
CPT (80%)60.92
4
Model RoutingFRAMES (ID queries)
CPT (85%) Score69.41
4
Query RoutingFRAMES
Hypervolume0.8865
4
Showing 25 of 32 rows