Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

About

Multimodal visual object tracking can be divided into to several kinds of tasks (e.g. RGB and RGB+X tracking), based on the input modality. Existing methods often train separate models for each modality or rely on pretrained models to adapt to new modalities, which limits efficiency, scalability, and usability. Thus, we introduce OneTrackerV2, a unified multi-modal tracking framework that enables end-to-end training for any modality. We propose Meta Merger to embed multi-modal information into a unified space, allowing flexible modality fusion and robustness. We further introduce Dual Mixture-of-Experts (DMoE): T-MoE models spatio-temporal relations for tracking, while M-MoE embeds multi-modal knowledge, disentangling cross-modal dependencies and reducing feature conflicts. With a shared architecture, unified parameters, and a single end-to-end training, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks and 12 benchmarks, while maintaining high inference efficiency. Notably, even after model compression, OneTrackerV2 retains strong performance. Moreover, OneTrackerV2 demonstrates remarkable robustness under modality-missing scenarios.

Lingyi Hong, Jinglun Li, Xinyu Zhou, Kaixun Jiang, Pinxue Guo, Zhaoyu Chen, Runze Li, Xingdong Sheng, Wenqiang Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Object TrackingLaSoT
AUC76.1
498
Object TrackingTrackingNet
Precision (P)89
327
Visual Object TrackingGOT-10k
AO81.3
306
Visual Object TrackingTNL2K
AUC69.5
169
RGBT TrackingRGBT234--
112
Visual Object TrackingDepthTrack
Recall0.684
106
Object TrackingVisEvent
AUC65.9
61
Visual TrackingUAV123
AUC71.1
56
RGB-E TrackingVisEvent--
46
TrackingOTB99
AUC0.732
45
Showing 10 of 20 rows

Other info

Follow for update