Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Enhancing Acoustic-to-Articulatory Speech Inversion by Incorporating Nasality

About

Speech is produced through the coordination of vocal tract constricting organs: lips, tongue, velum, and glottis. Previous works developed Speech Inversion (SI) systems to recover acoustic-to-articulatory mappings for lip and tongue constrictions, called oral tract variables (TVs), which were later enhanced by including source information (periodic and aperiodic energies, and F0 frequency) as proxies for glottal control. Comparison of the nasometric measures with high-speed nasopharyngoscopy showed that nasalance can serve as ground truth, and that an SI system trained with it reliably recovers velum movement patterns for American English speakers. Here, two SI training approaches are compared: baseline models that estimate oral TVs and nasalance independently, and a synergistic model that combines oral TVs and source features with nasalance. The synergistic model shows relative improvements of 5% in oral TVs estimation and 9% in nasalance estimation compared to the baseline models.

Saba Tabatabaee, Suzanne Boyce, Liran Oren, Mark Tiede, Carol Espy-Wilson• 2025

Related benchmarks

TaskDatasetResultRank
Speech InversionXRMB with non-babble noise (test)
LA0.9
12
Speech InversionXRMB with babble noise (test)
LA0.89
12
Speech InversionXRMB clean (test)
LA0.91
3
Showing 3 of 3 rows

Other info

Follow for update