HyperTokens: Controlling Token Dynamics for Continual Video-Language Understanding

About

Continual VideoQA with multimodal LLMs is hindered by interference between tasks and the prohibitive cost of storing task-specific prompts. We introduce HyperTokens, a transformer-based token generator that produces fine-tuning tokens on demand, giving explicit control over prompt updates while keeping memory fixed. To suppress forgetting, we propose meta-inspired regularisers that look ahead to avoid task-specific sharp directions and anchor the evolving generator to prior tasks. We further connect our objective to sharpness-aware optimisation, providing insight into why it encourages flatter cross-task minima and improves retention. Beyond regularisation, HyperTokens exploits lightweight auxiliary multimodal supervision through shared generation weights; guided by a causal perspective, we design feasible objectives and surrogate mutual-information losses to regularise anti-causal cross-modal directions. Across two standard continual VideoQA benchmarks, HyperTokens achieves higher average accuracy with substantially lower forgetting. Finally, we introduce a challenging cross-modal ImageQA->VideoQA protocol and show that HyperTokens enables robust continual transfer in this setting.

Toan Nguyen, Yang Liu, Celso De Melo, Flora D. Salim• 2026

Related benchmarks

Task	Dataset	Result
Continual Video Question Answering	NExT-QA (test)	Accuracy64.75	9
Continual Video Question Answering	DramaQA (test)	Accuracy71.62	9
Image Question Answering	Visual7w	Accuracy45.59	2

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord