ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers
About
Face Image Quality Assessment (FIQA) is essential for reliable face recognition systems. Current approaches primarily exploit only final-layer representations, while training-free methods require multiple forward passes or backpropagation. We propose ViTNT-FIQA, a training-free approach that measures the stability of patch embedding evolution across intermediate Vision Transformer (ViT) blocks. We demonstrate that high-quality face images exhibit stable feature refinement trajectories across blocks, while degraded images show erratic transformations. Our method computes Euclidean distances between L2-normalized patch embeddings from consecutive transformer blocks and aggregates them into image-level quality scores. We empirically validate this correlation on a quality-labeled synthetic dataset with controlled degradation levels. Unlike existing training-free approaches, ViTNT-FIQA requires only a single forward pass without backpropagation or architectural modifications. Through extensive evaluation on eight benchmarks (LFW, AgeDB-30, CFP-FP, CALFW, Adience, CPLFW, XQLFW, IJB-C), we show that ViTNT-FIQA achieves competitive performance with state-of-the-art methods while maintaining computational efficiency and immediate applicability to any pre-trained ViT-based face recognition model.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Face Image Quality Assessment | Adience (test) | pAUC (FMR=1e-3)0.0107 | 19 | |
| Face Image Quality Assessment | Adience | Performance Score @ 1e-30.0218 | 19 | |
| Face Image Quality Assessment | XQLFW | Score (1e-3)0.2386 | 5 | |
| Face Image Quality Assessment | XQLFW (test) | pAUC (FMR=1e-3)0.1318 | 5 | |
| Face Image Quality Assessment | IJB-C | pAUC (FMR=1e-3)0.66 | 5 | |
| Face Image Quality Assessment | AgeDB-30 | Config Value (1e-3)0.0262 | 4 |