ZSG-IAD: A Multimodal Framework for Zero-Shot Grounded Industrial Anomaly Detection
About
Deep learning-based industrial anomaly detectors often behave as black boxes, making it hard to justify decisions with physically meaningful defect evidence. We propose ZSG-IAD, a multimodal vision-language framework for zero-shot grounded industrial anomaly detection. Given RGB images, sensor images, and 3D point clouds, ZSG-IAD generates structured anomaly reports and pixel-level anomaly masks. ZSG-IAD introduces a language-guided two-hop grounding module: (1) anomaly-related sentences select evidence-like latent slots distilled from multimodal features, yielding coarse spatial support; (2) selected slots modulate feature maps via channel-spatial gating and a lightweight decoder to produce fine-grained masks. To improve reliability, we further apply Executable-Rule GRPO with verifiable rewards to promote structured outputs, anomaly-region consistency, and reasoning-conclusion coherence. Experiments across multiple industrial anomaly benchmarks show strong zero-shot performance and more transparent, physically grounded explanations than prior methods. We will release code and annotations to support future research on trustworthy industrial anomaly detection systems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Anomaly Detection | VisA (test) | -- | 148 | |
| Anomaly Detection | MPDD (test) | Image-level AU-ROC82.3 | 104 | |
| Anomaly Detection | BTAD (test) | -- | 43 | |
| Anomaly Detection | AITEX (test) | AUC-ROC0.711 | 17 | |
| Industrial Anomaly Detection and Grounded Reporting | MM-IAD-ReportBench | Accuracy82.4 | 15 | |
| Anomaly Detection | ELPV (test) | -- | 9 |