Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation
About
In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to-end methods using large audio-language models, Speech-Copilot builds speech processing-specific toolsets by analyzing pre-collected task instructions and breaking tasks into manageable sub-tasks. It features a flexible agent based on large language models that performs tasks through program generation. Our approach achieves state-of-the-art performance on the Dynamic-SUPERB benchmark, demonstrating its effectiveness across diverse speech-processing tasks. Key contributions include: 1) developing an innovative framework for speech processing-specific toolset construction, 2) establishing a high-performing agent based on large language models, and 3) offering a new perspective on addressing challenging instruction-oriented speech-processing tasks. Without additional training processes required by end-to-end approaches, our method provides a flexible and extendable solution for a wide range of speech-processing applications.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Assessment Correlation | RELATE | LCC0.405 | 25 | |
| Audio Assessment Correlation | PAM | LCC0.474 | 23 | |
| Audio-Text Alignment Evaluation | BRACE Clotho-Main 1.0 (test) | HH76.2 | 20 | |
| Text-Audio Alignment | Baton two-sound-event | AUC0.59 | 20 | |
| Hallucination Detection | BRACE Hallucination 1.0 (test) | AudioCaps Score95.4 | 20 | |
| Text-Audio Alignment | Baton three-sound-event | AUC0.55 | 20 | |
| Audio-Text Alignment Evaluation | BRACE AudioCaps-Main 1.0 (test) | HH57.6 | 20 | |
| Text-Audio Alignment | RELATE-Pair | Pair Accuracy68.6 | 20 | |
| Text-Audio Alignment | Baton-Pair two-sound-event | Pair Accuracy46.1 | 20 | |
| Text-Audio Alignment | Baton-Pair three-sound-event | Pair Accuracy42 | 20 |