Enhancing Acoustic-to-Articulatory Speech Inversion by Incorporating Nasality

About

Speech is produced through the coordination of vocal tract constricting organs: lips, tongue, velum, and glottis. Previous works developed Speech Inversion (SI) systems to recover acoustic-to-articulatory mappings for lip and tongue constrictions, called oral tract variables (TVs), which were later enhanced by including source information (periodic and aperiodic energies, and F0 frequency) as proxies for glottal control. Comparison of the nasometric measures with high-speed nasopharyngoscopy showed that nasalance can serve as ground truth, and that an SI system trained with it reliably recovers velum movement patterns for American English speakers. Here, two SI training approaches are compared: baseline models that estimate oral TVs and nasalance independently, and a synergistic model that combines oral TVs and source features with nasalance. The synergistic model shows relative improvements of 5% in oral TVs estimation and 9% in nasalance estimation compared to the baseline models.

Saba Tabatabaee, Suzanne Boyce, Liran Oren, Mark Tiede, Carol Espy-Wilson• 2025

Related benchmarks

Task	Dataset	Result
Speech Inversion	XRMB with non-babble noise (test)	LA0.9	12
Speech Inversion	XRMB with babble noise (test)	LA0.89	12
Speech Inversion	XRMB clean (test)	LA0.91	3
Articulatory Parameter Estimation	XRMB (test)	LA0.91	2

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord