Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TrajTok: Learning Trajectory Tokens enables better Video Understanding

About

Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.

Chenhao Zheng, Jieyu Zhang, Jianing Zhang, Weikai Huang, Ashutosh Kumar, Quan Kong, Oncel Tuzel, Chun-Liang Li, Ranjay Krishna• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30K--
559
Text-to-Video RetrievalMSR-VTT--
406
Image ClassificationCIFAR-100--
357
Text-to-Video RetrievalActivityNet--
245
Video-to-Text retrievalMSR-VTT--
221
Text-to-Image RetrievalCOCO--
156
Image-to-Text RetrievalCOCO--
152
Video Action ClassificationSomething-Something v2
Top-1 Acc48.7
145
Video-to-Text retrievalActivityNet--
136
Text-to-Video RetrievalVATEX--
134
Showing 10 of 15 rows

Other info

Follow for update