FaceLiVT: Face Recognition using Linear Vision Transformer with Structural Reparameterization For Mobile Device
About
This paper introduces FaceLiVT, a lightweight yet powerful face recognition model that integrates a hybrid Convolution Neural Network (CNN)-Transformer architecture with an innovative and lightweight Multi-Head Linear Attention (MHLA) mechanism. By combining MHLA alongside a reparameterized token mixer, FaceLiVT effectively reduces computational complexity while preserving competitive accuracy. Extensive evaluations on challenging benchmarks; including LFW, CFP-FP, AgeDB-30, IJB-B, and IJB-C; highlight its superior performance compared to state-of-the-art lightweight models. MHLA notably improves inference speed, allowing FaceLiVT to deliver high accuracy with lower latency on mobile devices. Specifically, FaceLiVT is 8.6 faster than EdgeFace, a recent hybrid CNN-Transformer model optimized for edge devices, and 21.2 faster than a pure ViT-Based model. With its balanced design, FaceLiVT offers an efficient and practical solution for real-time face recognition on resource-constrained platforms.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Face Recognition | LFW | Accuracy99.7 | 206 | |
| Face Verification | CA-LFW | Accuracy95.76 | 98 | |
| Face Recognition | CFP-FP | Accuracy97.2 | 98 | |
| Face Recognition | IJB-C | TAR (FAR=1e-4)95.7 | 51 | |
| Face Recognition | IJB-B | TAR @ FAR=1e-493.7 | 51 | |
| Face Recognition | AgeDB-30 | Accuracy97.6 | 49 | |
| Face Recognition | CP-LFW | Accuracy90.97 | 26 | |
| Face Recognition | LFW, CA-LFW, CP-LFW, CFP-FP, and AgeDB-30 (test) | Mean Accuracy (%)96.25 | 16 | |
| Face Recognition | IJB-B | TPR (FPR=1e-4)93.7 | 11 | |
| Face Recognition | IJB-C | TPR @ FAR=1e-495.7 | 11 |