Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Inconsistency-aware Multimodal Schr\"odinger Bridge for Deepfake Localization

About

Audio-visual deepfake localization demands interval-level outputs that serve as temporal evidence. Despite recent progress, symmetric fusion under single-sided or asynchronous forgeries propagates cross-modal noise, degrading high-precision localization. We present IaMSB, an inconsistency-aware multimodal Schr\"odinger Bridge (SB) that jointly estimates cross-modal consistency and performs interval-level localization. Unlike diffusion models, SB minimizes path-distribution discrepancy and yields consistency scores without explicit noise injection or denoising. With the Schr\"odinger Bridge (SB), IaMSB unifies consistency estimation, cross-modal information selection, and bridge-step scheduling in one framework. Specifically, a lightweight coarse bridge first proposes candidate intervals and estimates cross-modal consistency; these statistics select cross-modal witness signals and allocate bridge steps asymmetrically across modalities. A refinement bridge then performs step-tuned fusion and outputs refined, time-aligned intervals. IaMSB anticipates single-sided and asynchronous forgeries and, using bottlenecked cross-modal interaction with step allocation, suppresses noise transfer, avoids unnecessary iterations. Across benchmarks, IaMSB stabilizes strict-IoU boundary precision, raising AP@0.95 by 3%~10%, and yields improved high-precision localization, particularly for single-sided forgeries.

Jiayu Xiong, Jing Wang, Qi Zhang, Wanlong Wang, Jun Xue• 2026

Related benchmarks

TaskDatasetResultRank
Temporal Forgery LocalizationAV-Deepfake1M (test)
mAP @ 0.590.31
22
Temporal Deepfake LocalizationLAV-DF
AP@0.599.33
10
Temporal Deepfake LocalizationTVIL
AP@0.596.89
5
Showing 3 of 3 rows

Other info

Follow for update