PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

About

Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.

Onkar Susladkar, Tushar Prakash, Adheesh Juvekar, Kiet A. Nguyen, Dong-Hwan Jang, Inderjit S Dhillon, Ismini Lourentzou• 2026

Related benchmarks

Task	Dataset	Result
Video Classification	Kinetics-400	--	131
Video Generation	UCF-101 (test)	--	105
Video Compression	MCL-JCV	--	92
Video Classification	Kinetics-600	Top-1 Accuracy77.11	90
Video Classification	Kinetics 700	Top-1 Accuracy74.08	46
Video Reconstruction	WebVid 10M	PSNR35.72	45
General Video Understanding	MVBench Overall	Accuracy86.03	39
Temporal Action Localization	THUMOS14 v1.0 (50%-50%)	mAP (Avg)33.17	34
Temporal Action Localization	ActivityNet 1.3 (50%-50%)	Avg mAP29.11	31
Frame Reconstruction	COCO (val)	PSNR36.05	12

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord