TIPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection
About
Anomaly detection identifies departures from expected behavior in safety-critical settings. When target-domain normal data are unavailable, zero-shot anomaly detection (ZSAD) leverages vision-language models (VLMs). However, CLIP's coarse image-text alignment limits both localization and detection due to (i) spatial misalignment and (ii) weak sensitivity to fine-grained anomalies; prior works compensate with complex auxiliary modules yet largely overlook the choice of backbone. We revisit the backbone and use TIPS-a VLM trained with spatially aware objectives. While TIPS alleviates CLIP's issues, it exposes a distributional gap between global and local features. We address this with decoupled prompts-fixed for image-level detection and learnable for pixel-level localization-and by injecting local evidence into the global score. Without CLIP-specific tricks, our TIPS-based pipeline improves image-level performance by 1.1-3.9% and pixel-level by 1.5-6.9% across seven industrial datasets, delivering strong generalization with a lean architecture. Code is available at github.com/AlirezaSalehy/Tipsomaly.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Anomaly Detection | VisA | AUROC87.7 | 199 | |
| Anomaly Detection | MVTec | AUROC93.4 | 65 | |
| Anomaly Detection | KSDD | AUROC0.978 | 40 | |
| Image-level Anomaly Detection | DAGM | AUROC99.7 | 28 | |
| Anomaly Detection | DTD | AUROC99.4 | 28 | |
| Image-level Anomaly Detection | HeadCT | AUROC92.7 | 24 | |
| Anomaly Localization | VisA | AUROC95.9 | 23 | |
| Anomaly Detection | BTAD | AUROC95 | 20 | |
| Anomaly Localization | KSDD | AUROC99.5 | 19 | |
| Anomaly Localization | DTD | AUROC99.3 | 19 |