ViLL-E: Video LLM Embeddings for Retrieval

About

Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We introduce ViLL-E (Video-LLM-Embed), a unified VideoLLM architecture endowed with a novel embedding generation mechanism that allows the model to "think longer" for complex videos and stop early for easy ones. We train this model with a three-stage training methodology combining generative and contrastive learning: initial large-scale pre-training with video-caption pairs; followed by continual training on a smaller, detailed-caption dataset; and concluding with task-specific fine-tuning on a novel multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching. Our model significantly improves temporal localization (on avg. 7% over other VideoLLMs) and video retrieval (up to 4% over dual encoder models), achieving performance comparable to state-of-the-art specialized embedding models while remaining competitive on VideoQA tasks. Furthermore, our joint contrastive-generative training unlocks new zero-shot capabilities, significantly outperforming state-of-the-art methods in composed video retrieval (+5% over SotA) and retrieval from long text (+2% over SotA).

Rohit Gupta, Jayakrishnan Unnikrishnan, Fan Fei, Sheng Liu, Son Tran, Mubarak Shah• 2026

Related benchmarks

Task	Dataset	Result
Video Question Answering	MSRVTT-QA	Accuracy65.2	513
Video Question Answering	MSVD-QA	Accuracy75.2	401
Video Question Answering	VideoMME	Accuracy45	254
Video Question Answering	MVBench	Accuracy64.7	72
Video Question Answering	Video-ChatGPT	Average Score3.7	54
Video Retrieval	AuroraCap-VideoDetailCaption Long Caption 1.0 (test)	R@175.7	8
Video Retrieval	AuroraCap-VideoDetailCaption Short Caption 1.0 (test)	R@165.5	8
Composed Video Retrieval	MSR-VTT	Recall@153.13	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord