The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation

About

Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself. In this work, we take an orthogonal approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift. We showcase the unreasonable effectiveness of integrating LLVMs for SFVUDA by devising an intuitive and parameter-efficient method, which we name Domain Adaptation with Large Language-Vision models (DALL-V), that distills the world prior and complementary source model information into a student network tailored for the target. Despite the simplicity, DALL-V achieves significant improvement over state-of-the-art SFVUDA methods.

Giacomo Zara, Alessandro Conti, Subhankar Roy, St\'ephane Lathuili\`ere, Paolo Rota, Elisa Ricci• 2023

Related benchmarks

Task	Dataset	Result
Unsupervised Domain Adaptation	UCF-HMDB	Accuracy (U -> H)88.9	52
Action Recognition	UCF-HMDB	Accuracy (UCF -> HMDB)91.4	46
Action Recognition	Sports-DA (test)	Accuracy (U -> S)0.883	13
Action Recognition	Daily-DA	H->A Transition Rate17.5	13
Video Unsupervised Domain Adaptation	Daily-DA (test)	Accuracy (H → A)24	13

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord