Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HyperTokens: Controlling Token Dynamics for Continual Video-Language Understanding

About

Continual VideoQA with multimodal LLMs is hindered by interference between tasks and the prohibitive cost of storing task-specific prompts. We introduce HyperTokens, a transformer-based token generator that produces fine-tuning tokens on demand, giving explicit control over prompt updates while keeping memory fixed. To suppress forgetting, we propose meta-inspired regularisers that look ahead to avoid task-specific sharp directions and anchor the evolving generator to prior tasks. We further connect our objective to sharpness-aware optimisation, providing insight into why it encourages flatter cross-task minima and improves retention. Beyond regularisation, HyperTokens exploits lightweight auxiliary multimodal supervision through shared generation weights; guided by a causal perspective, we design feasible objectives and surrogate mutual-information losses to regularise anti-causal cross-modal directions. Across two standard continual VideoQA benchmarks, HyperTokens achieves higher average accuracy with substantially lower forgetting. Finally, we introduce a challenging cross-modal ImageQA->VideoQA protocol and show that HyperTokens enables robust continual transfer in this setting.

Toan Nguyen, Yang Liu, Celso De Melo, Flora D. Salim• 2026

Related benchmarks

TaskDatasetResultRank
Continual Video Question AnsweringNExT-QA (test)
Accuracy64.75
9
Continual Video Question AnsweringDramaQA (test)
Accuracy71.62
9
Image Question AnsweringVisual7w
Accuracy45.59
2
Showing 3 of 3 rows

Other info

Follow for update