Convolutions Need Registers Too: HVS-Inspired Dynamic Attention for Video Quality Assessment

About

No-reference video quality assessment (NR-VQA) estimates perceptual quality without a reference video, which is often challenging. While recent techniques leverage saliency or transformer attention, they merely address global context of the video signal by using static maps as auxiliary inputs rather than embedding context fundamentally within feature extraction of the video sequence. We present Dynamic Attention with Global Registers for Video Quality Assessment (DAGR-VQA), the first framework integrating register-token directly into a convolutional backbone for spatio-temporal, dynamic saliency prediction. By embedding learnable register tokens as global context carriers, our model enables dynamic, HVS-inspired attention, producing temporally adaptive saliency maps that track salient regions over time without explicit motion estimation. Our model integrates dynamic saliency maps with RGB inputs, capturing spatial data and analyzing it through a temporal transformer to deliver a perceptually consistent video quality assessment. Comprehensive tests conducted on the LSVQ, KonVid-1k, LIVE-VQC, and YouTube-UGC datasets show that the performance is highly competitive, surpassing the majority of top baselines. Research on ablation studies demonstrates that the integration of register tokens promotes the development of stable and temporally consistent attention mechanisms. Achieving an efficiency of 387.7 FPS at 1080p, DAGR-VQA demonstrates computational performance suitable for real-time applications like multimedia streaming systems.

Mayesha Maliha R. Mithila, Mylene C.Q. Farias• 2026

Related benchmarks

Task	Dataset	Result
No-Reference Video Quality Assessment	LIVE-VQC	SRCC0.886	50
No-Reference Video Quality Assessment	YouTube-UGC	SRCC0.91	47
No-Reference Video Quality Assessment	KoNViD-1k	SRCC0.896	42
Video Quality Assessment	LIVE-VQC, KoNViD-1k, YouTube-UGC (Weighted Average)	SROCC0.9	23
No-Reference Video Quality Assessment	LSVQ	PLCC0.892	13
Video Quality Assessment	10 videos (240 frames each) 1080p (test)	Inference Time (s)6.19	8
Saliency Prediction	DHF1K (val)	NSS3.683	7
Saliency Prediction	DIEM (val)	NSS2.856	7

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord