Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ViLL-E: Video LLM Embeddings for Retrieval

About

Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We introduce ViLL-E (Video-LLM-Embed), a unified VideoLLM architecture endowed with a novel embedding generation mechanism that allows the model to "think longer" for complex videos and stop early for easy ones. We train this model with a three-stage training methodology combining generative and contrastive learning: initial large-scale pre-training with video-caption pairs; followed by continual training on a smaller, detailed-caption dataset; and concluding with task-specific fine-tuning on a novel multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching. Our model significantly improves temporal localization (on avg. 7% over other VideoLLMs) and video retrieval (up to 4% over dual encoder models), achieving performance comparable to state-of-the-art specialized embedding models while remaining competitive on VideoQA tasks. Furthermore, our joint contrastive-generative training unlocks new zero-shot capabilities, significantly outperforming state-of-the-art methods in composed video retrieval (+5% over SotA) and retrieval from long text (+2% over SotA).

Rohit Gupta, Jayakrishnan Unnikrishnan, Fan Fei, Sheng Liu, Son Tran, Mubarak Shah• 2026

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringMSRVTT-QA
Accuracy65.2
491
Video Question AnsweringMSVD-QA
Accuracy75.2
360
Video Question AnsweringVideoMME
Accuracy45
210
Video Question AnsweringMVBench
Accuracy64.7
42
Video Question AnsweringVideo-ChatGPT--
28
Video RetrievalAuroraCap-VideoDetailCaption Long Caption 1.0 (test)
R@175.7
8
Video RetrievalAuroraCap-VideoDetailCaption Short Caption 1.0 (test)
R@165.5
8
Composed Video RetrievalMSR-VTT
Recall@153.13
3
Showing 8 of 8 rows

Other info

Follow for update