Optimal Transport Audio Distance with Learned Riemannian Ground Metrics
About
In audio generation evaluation, Fr\'echet Audio Distance (FAD) is a 2-Wasserstein distance with structural constraints for both primitives: the cost is a frozen embedding pullback whose invariance set hides severe artifacts, and the coupling is a Gaussian fit that dilutes rank-1 contamination relative to discrete OT. We propose Optimal Transport Audio Distance (OTAD), which corrects each primitive with one dedicated mechanism -- a residual Riemannian ground-metric adapter for the cost and entropic Sinkhorn optimal transport for the coupling. Across eight encoders under a four-axis protocol, coupling-only comparisons at $\epsilon = 0.05$ show that Sinkhorn's rank-1 sensitivity exceeds FAD's by a factor of 1.9 to 3.6. Furthermore, OTAD achieves a higher mean Spearman correlation with audio-quality MOS (DCASE 2023 Task 7) than baseline metrics. As an intrinsic benefit of the discrete transport plan, OTAD yields per-sample diagnostics with AUROC $\ge 0.86$, a capability that scalar- or kernel-aggregated metrics structurally lack.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Quality Assessment | DCASE Task 7 System-level n=9 2023 | -- | 8 | |
| Audio Quality Assessment | DCASE Task 7 Per-category granularity 2023 | -- | 6 | |
| Artifact Detection | Per-sample diagnostics Gaussian noise | -- | 5 | |
| Artifact Detection | Per-sample diagnostics Cross-class noise | -- | 5 | |
| Artifact Detection | Per-sample diagnostics Silence noise | -- | 5 | |
| Audio category-fit MOS correlation assessment | DCASE Task 7 2023 (test) | -- | 2 | |
| Audio Quality Assessment | DCASE Task 7 Per-category n=63 2023 | -- | 2 |