Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LightAVSeg: Lightweight Audio-Visual Segmentation

About

Audio-Visual Segmentation (AVS) targets pixel level localization of sounding emitting objects in videos. However, existing models rely on dense cross-modal attention with quadratic computational cost, limiting their suitability for resource efficient deployment. Most efficiency oriented methods focus on backbone reduction and overlook the interaction module as the primary bottleneck. This paper proposes LightAVSeg, a lightweight framework that replaces heavy attention with a decoupled design for semantic filtering and spatial grounding, resulting in interaction costs that scale linearly with spatial resolution. Furthermore, we introduce an auxiliary alignment loss to enforce semantic consistency during training with zero inference overhead. Extensive experiments demonstrate that LightAVSeg achieves a new state-of-the-art among lightweight methods: with 20.5M parameters ~1/7 of AVSegFormer), it reaches 50.4 mIoU on the MS3 benchmark and enables efficient inference on a mobile processor.

Qing Zhong, Guodong Ding, Lingqiao Liu, Zaiwen Feng, Lin Yuanbo Wu, Angela Yao• 2026

Related benchmarks

TaskDatasetResultRank
Audio-Visual SegmentationAVSBench MS3
MJ Score50.4
7
Audio-Visual Semantic SegmentationAVSS
MJ Score30.6
7
Audio-Visual SegmentationAVSBench S4
MJ Score75.6
7
Showing 3 of 3 rows

Other info

Follow for update