Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization

About

This paper presents the \textbf{S}emantic-a\textbf{W}ar\textbf{E} spatial-t\textbf{E}mporal \textbf{T}okenizer (SweetTok), a novel video tokenizer to overcome the limitations in current video tokenization methods for compacted yet effective discretization. Unlike previous approaches that process flattened local visual patches via direct discretization or adaptive query tokenization, SweetTok proposes a decoupling framework, compressing visual inputs through distinct spatial and temporal queries via \textbf{D}ecoupled \textbf{Q}uery \textbf{A}uto\textbf{E}ncoder (DQAE). This design allows SweetTok to efficiently compress video token count while achieving superior fidelity by capturing essential information across spatial and temporal dimensions. Furthermore, we design a \textbf{M}otion-enhanced \textbf{L}anguage \textbf{C}odebook (MLC) tailored for spatial and temporal compression to address the differences in semantic representation between appearance and motion information. SweetTok significantly improves video reconstruction results by \textbf{42.8\%} w.r.t rFVD on UCF-101 dataset. With a better token compression strategy, it also boosts downstream video generation results by \textbf{15.1\%} w.r.t gFVD. Additionally, the compressed decoupled tokens are imbued with semantic information, enabling few-shot recognition capabilities powered by LLMs in downstream applications.

Zhentao Tan, Ben Xue, Jian Jia, Junhao Wang, Wencai Ye, Shaoyun Shi, Mingjie Sun, Wenjin Wu, Quan Chen, Peng Jiang• 2024

Related benchmarks

TaskDatasetResultRank
Video ClassificationKinetics-400--
131
Video GenerationUCF-101 (test)--
105
Video ClassificationKinetics-600
Top-1 Accuracy65.01
84
Video ClassificationKinetics 700
Top-1 Accuracy61.45
46
Video ReconstructionWebVid 10M
PSNR32.32
34
Temporal Action LocalizationTHUMOS14 v1.0 (50%-50%)
mAP (Avg)25.32
17
Temporal Action LocalizationActivityNet 1.3 (50%-50%)
Avg mAP24.53
17
Video ReconstructionUCF-101 (test)
rFVD18
17
Frame ReconstructionCOCO (val)
PSNR32.78
12
Showing 9 of 9 rows

Other info

Follow for update