Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset
About
We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence. Additional details and resources can be found in this URL: https://ninaneon.github.io/projectpage/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Anomaly Detection | MVTec-AD (test) | P-AUROC96.1 | 132 | |
| Anomaly Detection | VisA (test) | -- | 91 | |
| Defect Classification | MVTec AD | Accuracy98.3 | 19 | |
| Defect Classification | VisA | Accuracy97.7 | 19 | |
| Defect Classification | Magnetic Tile | Accuracy96.2 | 1 | |
| Defect Classification | Steel Surface | Accuracy94.5 | 1 | |
| Industrial Anomaly Detection | MVTec-AD (test) | Acc91 | 1 | |
| Pixel-level Segmentation | MVTec AD bottle | Accuracy92.25 | 1 | |
| Pixel-level Segmentation | MVTec AD cable | Accuracy89.7 | 1 | |
| Pixel-level Segmentation | VisA candle | Accuracy90.3 | 1 |