Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Text Similarity on Insurance tasks HQ subset N = 1334
Loading...
89.9
Mean Score
DeepSeek-R1 + Fine-tune
72.948
77.349
81.75
86.151
Feb 18, 2026
Mean Score
Standard Deviation
Median Score
Maximum Score
P(SimBERT >= Kappa)
Updated 4d ago
Evaluation Results
Method
Method
Links
Mean Score
Standard Deviation
Median Score
Maximum Score
P(SimBERT >= Kappa)
DeepSeek-R1 + Fine-tune
Count=1334
2026.02
89.9
0.132
94.6
100
86.3
Gemini-2.5-Flash
Count=1334
2026.02
81.7
0.146
86.7
97.7
73.3
GPT-4o-mini
Count=1334
2026.02
80.6
0.149
85.6
97.4
70.5
Claude-Haiku-4.5
Count=1334
2026.02
77.6
0.159
81.9
97
62.4
GPT-4.1
Count=1334
2026.02
76.7
0.152
80.9
97.7
60.5
GPT-5.2
Count=1334
2026.02
73.6
0.155
77.3
100
50.7
Feedback
Search any
task
Search any
task