Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Optimal Transport Audio Distance with Learned Riemannian Ground Metrics

About

In audio generation evaluation, Fr\'echet Audio Distance (FAD) is a 2-Wasserstein distance with structural constraints for both primitives: the cost is a frozen embedding pullback whose invariance set hides severe artifacts, and the coupling is a Gaussian fit that dilutes rank-1 contamination relative to discrete OT. We propose Optimal Transport Audio Distance (OTAD), which corrects each primitive with one dedicated mechanism -- a residual Riemannian ground-metric adapter for the cost and entropic Sinkhorn optimal transport for the coupling. Across eight encoders under a four-axis protocol, coupling-only comparisons at $\epsilon = 0.05$ show that Sinkhorn's rank-1 sensitivity exceeds FAD's by a factor of 1.9 to 3.6. Furthermore, OTAD achieves a higher mean Spearman correlation with audio-quality MOS (DCASE 2023 Task 7) than baseline metrics. As an intrinsic benefit of the discrete transport plan, OTAD yields per-sample diagnostics with AUROC $\ge 0.86$, a capability that scalar- or kernel-aggregated metrics structurally lack.

Wonwoo Jeong• 2026

Related benchmarks

TaskDatasetResultRank
Audio Quality AssessmentDCASE Task 7 System-level n=9 2023--
8
Audio Quality AssessmentDCASE Task 7 Per-category granularity 2023--
6
Artifact DetectionPer-sample diagnostics Gaussian noise--
5
Artifact DetectionPer-sample diagnostics Cross-class noise--
5
Artifact DetectionPer-sample diagnostics Silence noise--
5
Audio category-fit MOS correlation assessmentDCASE Task 7 2023 (test)--
2
Audio Quality AssessmentDCASE Task 7 Per-category n=63 2023--
2
Showing 7 of 7 rows

Other info

Follow for update