Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SAGE

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety evaluationSAGE-Eval
Safety90
18
LLM-as-a-Judge RobustnessSage (Hard)
Factuality (IPI)55.9
13
LLM-as-a-Judge RobustnessSage Easy
Factuality Error (IPI)0.059
13
Open-Ended Question AnsweringSAGE Web Search
Weighted Recall (Com. Sci.)35.1
12
Short-Form Question AnsweringSAGE Web Search
Accuracy (Com. Sci.)63.3
12
Multi-hop Question AnsweringSAGE Small-scale (evaluation)
# Search4.9
1
Showing 6 of 6 rows