Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning

About

Audio-visual understanding requires effective alignment between heterogeneous modalities, yet cross-modal correspondence remains challenging when temporally aligned audio and visual signals lack clear semantic correspondence. We propose to use text as a semantic anchor for audio-visual representation learning. To this end, we introduce a parameter-efficient adaptation framework built on frozen audio and visual encoders, centered on Text-Bridged Audio-Visual Adapter (TB-AVA), which enables text-mediated interaction between audio and visual streams. At the core of TB-AVA, Gated Semantic Modulation (GSM) selectively modulates feature channels based on text-inferred semantic relevance. We evaluate the proposed approach on multiple benchmarks, including AVE, AVS, and AVVP, where the proposed framework achieves state-of-the-art performance, demonstrating text as an effective semantic anchor for parameter-efficient fine-tuning (PEFT) in audio-visual learning.

Seongah Kim, Dinh Phu Tran, Hyeontaek Hwang, Saad Wazir, Duc Do Minh, Daeyoung Kim• 2026

Related benchmarks

TaskDatasetResultRank
Audio-Visual Video ParsingLLP (test)
Audio Segment Score56.4
89
Audio-Visual Event LocalizationAVE
Accuracy85
52
Audio-Visual SegmentationAVSBench-object S4
mIoU81.2
7
Audio-Visual SegmentationAVSBench-object MS3
mIoU53.4
7
Showing 4 of 4 rows

Other info

Follow for update