Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Faithfulness Evaluation on ArXiv (test)
Loading...
53.58
SummaC
o3
38.5104
42.4227
46.335
50.2473
Dec 3, 2025
SummaC
AlignScore
Updated 4d ago
Evaluation Results
Method
Method
Links
SummaC
AlignScore
o3
Model Category=LRM, Pr...
2025.12
53.58
85.22
GPT-5
Model Category=LRM, Pr...
2025.12
45.9
85.55
o1
Model Category=LRM, Pr...
2025.12
44.24
27.61
Vanilla
Base Model=GPT-4.1, Pr...
2025.12
43.36
26.07
Extract-to-Abstract (E2A)
Base Model=GPT-4.1, Pr...
2025.12
42.96
8.68
Cited Summarization (Cite)
Base Model=GPT-4.1, Pr...
2025.12
41.79
5.99
Self-Consistency (SC)
Base Model=GPT-4.1, Pr...
2025.12
40.68
8
Chain-of-Thought (COT)
Base Model=GPT-4.1, Pr...
2025.12
40.54
7.46
Decomposition (Deco)
Base Model=GPT-4.1, Pr...
2025.12
40.37
17.88
Question-Answer Guided (QAG)
Base Model=GPT-4.1, Pr...
2025.12
40.37
26.94
Iterative Refine (IR)
Base Model=GPT-4.1, Pr...
2025.12
39.87
5.65
Plan-then-Write (Plan)
Base Model=GPT-4.1, Pr...
2025.12
39.09
9.52
Feedback
Search any
task
Search any
task