Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Scene-aware visually driven speech synthesis on Vivid-210K (test)
Loading...
7.15
WER
VividVoice
7.0112
7.9481
8.885
9.8219
Feb 1, 2026
WER
FAD
KL Divergence
CLAPcap
MOS (Content)
MOS (Timbre)
MOS (Scene Consistency)
MOS (Naturalness)
Updated 4d ago
Evaluation Results
Method
Method
Links
WER
FAD
KL Divergence
CLAPcap
MOS (Content)
MOS (Timbre)
MOS (Scene Consistency)
MOS (Naturalness)
VividVoice
2026.02
7.15
3.98
1.53
0.25
3.95
3.08
4.3
3.88
VoiceLDM
2026.02
9.23
4.74
1.79
0.27
3.23
1.75
2.56
3.41
GT
2026.02
10.62
-
-
0.39
4.36
4.03
4.11
4.25
Feedback
Search any
task
Search any
task