Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation

About

A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.

Yuxuan Chen, Peize He, Haoyuan Yu, Junzi Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Musical Instrument ClassificationNSynth
Accuracy70.7
106
Speech Emotion RecognitionRAVDESS--
43
Music Genre ClassificationGTZAN
Accuracy94.5
39
Speaker IdentificationLibriSpeech MF
Score98.1
26
Language IdentificationVoxLingua33
Accuracy89.5
26
Speaker CountingLibricount
Score64.4
26
Acoustic Event ClassificationVocalSound
Normalized Score93
20
Automatic Speaker VerificationASV 2015
Normalized Score99
20
Music Genre ClassificationFMA (Free Music Archive)
Normalized Score68.9
20
Speaker IdentificationVoxCeleb 1
Normalized Score45.5
20
Showing 10 of 17 rows

Other info

Follow for update