Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Spoken Dialogue System (SDS) Semantic Quality Evaluation on Eval2000 (test)
Loading...
12.1
ROUGE-L
Multi turn CoT E2E
7.94
9.02
10.1
11.18
Jan 27, 2026
ROUGE-L
Perplexity
AutoBLEU
LLM Judge Score
Low Score Response Rate (<5)
Win Rate
Updated 4d ago
Evaluation Results
Method
Method
Links
ROUGE-L
Perplexity
AutoBLEU
LLM Judge Score
Low Score Response Rate (<5)
Win Rate
Multi turn CoT E2E
type=E2E baseline
2026.01
12.1
21.2
68.3
6.18
10.2
-
Multi turn CoT E2E + RLAIF (Single-Reward)
Backbone=Multi turn Co...
2026.01
11.9
19.9
56.5
6.33
7.1
55.4
Multi turn CoT E2E + RLAIF (Joint-Reward-v2)
Backbone=Multi turn Co...
2026.01
11.9
19.9
59.9
6.33
7.5
54.4
Multi turn CoT E2E + RLAIF (Joint-Reward-v1)
Backbone=Multi turn Co...
2026.01
11.8
19.6
61.3
6.29
8.5
52.6
Direct E2E
type=E2E baseline
2026.01
8.4
302.2
51.5
5.5
24.2
-
Moshi
type=Duplex SDS baseline
2026.01
8.1
136.5
57.8
5.71
21
-
Feedback
Search any
task
Search any
task