Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Semantic Similarity Evaluation on Insurance Tasks N=1500 (test)
Loading...
0.869
Mean BERT Cosine Similarity
DeepSeek-R1 + Fine-tune
0.713
0.7535
0.794
0.8345
Feb 18, 2026
Mean BERT Cosine Similarity
Std Dev BERT Cosine Similarity
Median BERT Cosine Similarity
Max BERT Cosine Similarity
P(Similarity >= Kappa)
Updated 4d ago
Evaluation Results
Method
Method
Links
Mean BERT Cosine Similarity
Std Dev BERT Cosine Similarity
Median BERT Cosine Similarity
Max BERT Cosine Similarity
P(Similarity >= Kappa)
DeepSeek-R1 + Fine-tune
Mode=Fine-tuned, Numbe...
2026.02
0.869
0.159
0.929
1
79.1
Gemini-2.5-Flash
2026.02
0.799
0.16
0.857
0.977
68.5
GPT-4o-mini
2026.02
0.787
0.163
0.844
0.974
65.6
Claude-Haiku-4.5
2026.02
0.757
0.17
0.803
0.974
57.5
GPT-4.1
2026.02
0.749
0.163
0.793
0.977
56.1
GPT-5.2
2026.02
0.719
0.167
0.762
1
47.2
Feedback
Search any
task
Search any
task