Ensemble-Guided Distillation for Compact and Robust Acoustic Scene Classification on Edge Devices

About

We present a compact, quantization-ready acoustic scene classification (ASC) framework that couples an efficient student network with a learned teacher ensemble and knowledge distillation. The student backbone uses stacked depthwise-separable "expand-depthwise-project" blocks with global response normalization to stabilize training and improve robustness to device and noise variability, while a global pooling head yields class logits for efficient edge inference. To inject richer inductive bias, we assemble a diverse set of teacher models and learn two complementary fusion heads: z1, which predicts per-teacher mixture weights using a student-style backbone, and z2, a lightweight MLP that performs per-class logit fusion. The student is distilled from the ensemble via temperature-scaled soft targets combined with hard labels, enabling it to approximate the ensemble's decision geometry with a single compact model. Evaluated on the TAU Urban Acoustic Scenes 2022 Mobile benchmark, our approach achieves state-of-the-art (SOTA) results on the TAU dataset under matched edge-deployment constraints, demonstrating strong performance and practicality for mobile ASC.

Hossein Sharify, Behnam Raoufi, Mahdy Ramezani, Khosrow Hajsadeghi, Saeed Bagheri Shouraki• 2025

Related benchmarks

Task	Dataset	Result	Rank
Acoustic Scene Classification	TAU Urban Acoustic Scenes Mobile 2022 (dev)	Accuracy60.6		5
Acoustic Scene Classification	TAU-UAS Mobile 2022 (25% split)	Accuracy59.9		5

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord