Lost in Speech: Benchmarking, Evaluation, and Parsing of Spoken Bilingual Conversational Language Beyond Standard UD Assumptions

About

Spoken bilingual conversations pose substantial challenges for syntactic parsing because they often include disfluencies and discourse-driven structures that complicate dependency parsing under standard Universal Dependencies (UD) assumptions and evaluation practices. To systematically study these challenges, in this work, we first introduce a linguistically grounded taxonomy of conversational bilingual phenomena, together with SpokeBench, an expert-annotated English-Spanish benchmark for structurally complex speech. To address the limitations of existing evaluation practices, we propose Flex-UD, an ambiguity-aware evaluation metric that distinguishes catastrophic structural failures from linguistically acceptable variations. Finally, we introduce DECAP, a decoupled agentic parsing framework that separates spoken-phenomena handling from core syntactic analysis, enabling robust and interpretable dependency parsing without retraining. Experiments across both proprietary and open-weight LLMs show that DECAP substantially improves performance on complex conversational phenomena and achieves over 60% improvements in UPOS-F1 Score over baselines, while Flex-UD evaluations reveal gains that otherwise remain partially hidden under standard attachment-based metrics.

Nemika Tyagi, Olga Kellert, Holly Hendrix, Nelvin Licona-Guevara, Justin Mackie, Phanos Kareen, Megan Michelle Smith, Tatiana Gallego Hernande, Samhitha Harish, Chitta Baral• 2026

Related benchmarks

Task	Dataset	Result
Syntactic Parsing	SpokeBench 1.0 (test)	LAS0.39	33
Universal Dependency Parsing	SpokeBench Contr. (EN) v1 (test)	ID Score72.5	3
Universal Dependency Parsing	SpokeBench Contr. (ES) v1 (test)	ID Score80	3
Universal Dependency Parsing	SpokeBench Repetition v1 (test)	ID Score72	3
Universal Dependency Parsing	SpokeBench Repetition+ v1 (test)	ID Score70	3
Universal Dependency Parsing	SpokeBench Ellipses v1 (test)	ID Score51.8	3
Universal Dependency Parsing	SpokeBench Ellipses+ v1 (test)	ID Score60	3
Universal Dependency Parsing	SpokeBench Discourse v1 (test)	ID Accuracy63.5	3
Universal Dependency Parsing	SpokeBench Discourse+ v1 (test)	ID Accuracy60.4	3
Universal Dependency Parsing	SpokeBench Complex v1 (test)	ID Accuracy55.8	3

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord