Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
RAG Question Answering on SCORE domain-specific hazard response dataset (test)
Loading...
61
Specificity
GPT-4o
44.36
48.68
53
57.32
Feb 10, 2026
Specificity
Robustness (Para.)
Robustness (Pert.)
Answer Relevance
Updated 3mo ago
Evaluation Results
Method
Method
Links
Specificity
Robustness (Para.)
Robustness (Pert.)
Answer Relevance
GPT-4o
Evaluator=GPT-4o
2026.02
61
89
60
69
GPT-4o
Evaluator=Qwen3-8B
2026.02
58
89
66
73
Gemini 2.5
Evaluator=GPT-4o
2026.02
55
87
87
68
Gemini 2.5
Evaluator=Qwen3-8B
2026.02
45
86
66
83
Feedback
Search any
task
Search any
task