PACE: Pretrained Audio Continual Learning
About
Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs), together with a comprehensive analysis of its unique challenges. Unlike in vision, where parameter-efficient fine-tuning (PEFT) has proven effective for CL, directly transferring such strategies to audio leads to poor performance. This stems from a fundamental property of audio backbones: they focus on low-level spectral details rather than structured semantics, causing severe upstream-downstream misalignment. Through extensive empirical study, we identify analytic classifiers with first-session adaptation (FSA) as a promising direction, but also reveal two major limitations: representation saturation in coarse-grained scenarios and representation drift in fine-grained scenarios. To address these challenges, we propose PACE, a novel method that enhances FSA via a regularized analytic classifier and enables multi-session adaptation through adaptive subspace-orthogonal PEFT for improved semantic alignment. In addition, we introduce spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability. Experiments on six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines, marking an important step toward robust and scalable audio continual learning with PTMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Classification | ESC-50 (test) | Accuracy95.75 | 84 | |
| Audio Classification | US8K (test) | R@1 Accuracy0.9749 | 41 | |
| Audio Classification | Speech Commands V2 (test) | Accuracy91.87 | 35 | |
| Audio Classification | TIMIT-2 (test) | Top-1 Accuracy90.95 | 18 | |
| Audio Classification | TIMIT 3 (test) | Average Top-1 Acc94.05 | 18 | |
| Audio Classification | VocalSet (test) | Top-1 Accuracy69.08 | 18 | |
| Continual Learning | GTZAN (5-session split) | Accuracy78 | 4 | |
| Continual Learning | ESC–Speech synthetic (ESC-50 + SpeechCommands V2) (10-session split) | Accuracy72.17 | 4 |