Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets

About

We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial's effectiveness on diverse attributes and prompts, achieving stronger controllability and higher-quality edits than state-of-the-art baselines, supported by extensive quantitative evaluation and human studies.

Zhixuan Liu, Peter Schaldenbrand, Yijun Li, Long Mai, Aniruddha Mahapatra, Cusuh Ham, Jean Oh, Jui-Hsien Wang• 2026

Related benchmarks

TaskDatasetResultRank
Slider ControllabilityFreeSliders (evaluation set)
Range (CR)46
7
Slider-based Video EditingUser Study Appearance and Motion Sliders
Editing Quality Score3.91
7
Video EditingEditVerse
Edit Quality4.165
7
Showing 3 of 3 rows

Other info

GitHub

Follow for update