Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection With Multichannel Audio and Multiscale Visual Cues

About

Idling vehicle detection (IVD) uses surveillance video and multichannel audio to localize and classify vehicles in the last frame as moving, idling, or engine-off in pick-up zones. IVD faces three challenges: (i) modality heterogeneity between visual cues and audio patterns; (ii) large box scale variation requiring multi-resolution detection; and (iii) training instability due to coupled detection heads. The previous end-to-end (E2E) model with simple CBAM-based bi-modal attention fails to handle these issues and often misses vehicles. We propose HAVT-IVD, a heterogeneity-aware network with a visual feature pyramid and decoupled heads. Experiments show HAVT-IVD improves mAP by 7.66 over the disjoint baseline and 9.42 over the E2E baseline.

Xiwen Li, Xiaoya Tang, Tolga Tasdizen• 2025

Related benchmarks

TaskDatasetResultRank
Snow RemovalCityScape + Snow100K
PSNR (dB)19.8179
10
Showing 1 of 1 rows

Other info

Follow for update