Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Convolutions Need Registers Too: HVS-Inspired Dynamic Attention for Video Quality Assessment

About

No-reference video quality assessment (NR-VQA) estimates perceptual quality without a reference video, which is often challenging. While recent techniques leverage saliency or transformer attention, they merely address global context of the video signal by using static maps as auxiliary inputs rather than embedding context fundamentally within feature extraction of the video sequence. We present Dynamic Attention with Global Registers for Video Quality Assessment (DAGR-VQA), the first framework integrating register-token directly into a convolutional backbone for spatio-temporal, dynamic saliency prediction. By embedding learnable register tokens as global context carriers, our model enables dynamic, HVS-inspired attention, producing temporally adaptive saliency maps that track salient regions over time without explicit motion estimation. Our model integrates dynamic saliency maps with RGB inputs, capturing spatial data and analyzing it through a temporal transformer to deliver a perceptually consistent video quality assessment. Comprehensive tests conducted on the LSVQ, KonVid-1k, LIVE-VQC, and YouTube-UGC datasets show that the performance is highly competitive, surpassing the majority of top baselines. Research on ablation studies demonstrates that the integration of register tokens promotes the development of stable and temporally consistent attention mechanisms. Achieving an efficiency of 387.7 FPS at 1080p, DAGR-VQA demonstrates computational performance suitable for real-time applications like multimedia streaming systems.

Mayesha Maliha R. Mithila, Mylene C.Q. Farias• 2026

Related benchmarks

TaskDatasetResultRank
No-Reference Video Quality AssessmentLIVE-VQC
SRCC0.886
50
No-Reference Video Quality AssessmentYouTube-UGC
SRCC0.91
47
No-Reference Video Quality AssessmentKoNViD-1k
SRCC0.896
42
Video Quality AssessmentLIVE-VQC, KoNViD-1k, YouTube-UGC (Weighted Average)
SROCC0.9
23
No-Reference Video Quality AssessmentLSVQ
PLCC0.892
13
Video Quality Assessment10 videos (240 frames each) 1080p (test)
Inference Time (s)6.19
8
Saliency PredictionDHF1K (val)
NSS3.683
7
Saliency PredictionDIEM (val)
NSS2.856
7
Showing 8 of 8 rows

Other info

Follow for update