Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

About

In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to-end methods using large audio-language models, Speech-Copilot builds speech processing-specific toolsets by analyzing pre-collected task instructions and breaking tasks into manageable sub-tasks. It features a flexible agent based on large language models that performs tasks through program generation. Our approach achieves state-of-the-art performance on the Dynamic-SUPERB benchmark, demonstrating its effectiveness across diverse speech-processing tasks. Key contributions include: 1) developing an innovative framework for speech processing-specific toolset construction, 2) establishing a high-performing agent based on large language models, and 3) offering a new perspective on addressing challenging instruction-oriented speech-processing tasks. Without additional training processes required by end-to-end approaches, our method provides a flexible and extendable solution for a wide range of speech-processing applications.

Chun-Yi Kuan, Chih-Kai Yang, Wei-Ping Huang, Ke-Han Lu, Hung-yi Lee• 2024

Related benchmarks

TaskDatasetResultRank
Audio Assessment CorrelationRELATE
LCC0.405
25
Audio Assessment CorrelationPAM
LCC0.474
23
Audio-Text Alignment EvaluationBRACE Clotho-Main 1.0 (test)
HH76.2
20
Text-Audio AlignmentBaton two-sound-event
AUC0.59
20
Hallucination DetectionBRACE Hallucination 1.0 (test)
AudioCaps Score95.4
20
Text-Audio AlignmentBaton three-sound-event
AUC0.55
20
Audio-Text Alignment EvaluationBRACE AudioCaps-Main 1.0 (test)
HH57.6
20
Text-Audio AlignmentRELATE-Pair
Pair Accuracy68.6
20
Text-Audio AlignmentBaton-Pair two-sound-event
Pair Accuracy46.1
20
Text-Audio AlignmentBaton-Pair three-sound-event
Pair Accuracy42
20
Showing 10 of 10 rows

Other info

Follow for update