Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Active Speakers in Context

About

Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker. Although this strategy can be enough for addressing single-speaker scenarios, it prevents accurate detection when the task is to identify who of many candidate speakers are talking. This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons. Our Active Speaker Context is designed to learn pairwise and temporal relations from an structured ensemble of audio-visual observations. Our experiments show that a structured feature ensemble already benefits the active speaker detection performance. Moreover, we find that the proposed Active Speaker Context improves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving a mAP of 87.1%. We present ablation studies that verify that this result is a direct consequence of our long-term multi-speaker analysis.

Juan Leon Alcazar, Fabian Caba Heilbron, Long Mai, Federico Perazzi, Joon-Young Lee, Pablo Arbelaez, Bernard Ghanem• 2020

Related benchmarks

TaskDatasetResultRank
Active Speaker DetectionAVA-ActiveSpeaker (val)
mAP87.1
107
Active Speaker DetectionAVA-ActiveSpeaker v1.0 (val)
mAP87.1
27
Active Speaker DetectionAVA-ActiveSpeaker (test)
mAP86.7
22
Active Speaker DetectionAVA-ActiveSpeaker v1.0 (test)
mAP86.7
13
Active Speaker DetectionUniTalk (test)
Overall mAP61.4
10
Active Speaker DetectionAVA-ActiveSpeaker ActivityNet Challenge 2019 (test)
mAP86.7
9
Active Speaker DetectionWASD (test)
mAP (OC)91.2
9
Active Speaker DetectionAVA-ActiveSpeaker Internal In-Domain (test)
mAP83.6
7
Active Speaker DetectionWASD External/Out-of-Domain (test)
mAP74.6
7
Active Speaker DetectionTalkies 1.0 (test)
mAP77.4
4
Showing 10 of 10 rows

Other info

Code

Follow for update