PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation
About
Discrete video VAEs underpin modern text-to-video generation and video understanding systems, yet existing tokenizers typically learn visual codebooks at a single scale with limited vocabularies and shallow language supervision, leading to poor cross-modal alignment and zero-shot transfer. We introduce PyraTok, a language-aligned pyramidal tokenizer that learns semantically structured discrete latents across multiple spatiotemporal resolutions. PyraTok builds on a pretrained video VAE and a novel Language aligned Pyramidal Quantization (LaPQ) module that discretizes encoder features at several depths using a shared large binary codebook, yielding compact yet expressive video token sequences. To tightly couple visual tokens with language, PyraTok jointly optimizes multi-scale text-guided quantization and a global autoregressive objective over the token hierarchy. Across ten benchmarks, PyraTok delivers state-of-the-art (SOTA) video reconstruction, consistently improves text-to-video quality, and sets new SOTA zero-shot performance on video segmentation, temporal action localization, and video understanding, scaling robustly to up to 4K/8K resolutions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Classification | Kinetics-400 | -- | 131 | |
| Video Generation | UCF-101 (test) | -- | 105 | |
| Video Classification | Kinetics-600 | Top-1 Accuracy77.11 | 84 | |
| Video Compression | MCL-JCV | -- | 60 | |
| Video Classification | Kinetics 700 | Top-1 Accuracy74.08 | 46 | |
| Video Reconstruction | WebVid 10M | PSNR35.72 | 34 | |
| Temporal Action Localization | THUMOS14 v1.0 (50%-50%) | mAP (Avg)33.17 | 17 | |
| Temporal Action Localization | ActivityNet 1.3 (50%-50%) | Avg mAP29.11 | 17 | |
| Frame Reconstruction | COCO (val) | PSNR36.05 | 12 | |
| General Video Understanding | MVBench Overall | Accuracy86.03 | 9 |