Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SAGE

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety evaluationSAGE-Eval
Safety90
18
Emotional Support ConversationSAGE (test)
Sentience85.07
14
LLM-as-a-Judge RobustnessSage (Hard)
Factuality (IPI)55.9
13
LLM-as-a-Judge RobustnessSage Easy
Factuality Error (IPI)0.059
13
Open-Ended Question AnsweringSAGE Web Search
Weighted Recall (Com. Sci.)35.1
12
Short-Form Question AnsweringSAGE Web Search
Accuracy (Com. Sci.)63.3
12
Multi-hop Question AnsweringSAGE Small-scale (evaluation)
# Search4.9
1
Showing 7 of 7 rows