Vision-Language Feature Alignment for Road Anomaly Segmentation
About
Safe autonomous systems in complex environments require robust road anomaly segmentation to identify unknown obstacles. However, existing approaches often rely on pixel-level statistics to determine whether a region appears anomalous. This reliance leads to high false-positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out-of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme's visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and Fishyscapes.Code is released on https://github.com/NickHezhuolin/VL-aligner-Road-anomaly-segment.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Anomaly Segmentation | Fishyscapes Lost & Found (val) | FPR958.4 | 74 | |
| Anomaly Segmentation | Fishyscapes Static (val) | FPR950.023 | 53 | |
| Anomaly Segmentation | SMIYC-RA21 (val) | AuPRC (%)95.1 | 22 | |
| Anomaly Segmentation | SMIYC-RO21 (val) | AuPRC91 | 22 | |
| Anomaly Segmentation | SMIYC RA 21 (val) | sIoU59.6 | 13 |