SPROUT: A Scalable Diffusion Foundation Model for Agricultural Vision
About
Vision Foundation Models (VFM) pre-trained on large-scale unlabeled data have achieved remarkable success on general computer vision tasks, yet typically suffer from significant domain gaps when applied to agriculture. In this context, we introduce $SPROUT$ ($S$calable $P$lant $R$epresentation model via $O$pen-field $U$nsupervised $T$raining), a multi-crop, multi-task agricultural foundation model trained via diffusion denoising. SPROUT leverages a VAE-free Pixel-space Diffusion Transformer to learn rich, structure-aware representations through denoising and enabling efficient end-to-end training. We pre-train SPROUT on a curated dataset of 2.6 million high-quality agricultural images spanning diverse crops, growth stages, and environments. Extensive experiments demonstrate that SPROUT consistently outperforms state-of-the-art web-pretrained and agricultural foundation models across a wide range of downstream tasks, while requiring substantially lower pre-training cost. The code and model are available at https://github.com/UTokyo-FieldPhenomics-Lab/SPROUT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | Cassava Leaf Disease | Accuracy91.61 | 24 | |
| Organ-level semantic segmentation | Apple Flower | mIoU65.92 | 18 | |
| Organ-level semantic segmentation | Apple Fruit | IoU74.28 | 18 | |
| Organ-level semantic segmentation | Peach Flower | mIoU62.09 | 18 | |
| Organ-level semantic segmentation | Pear Flower | mIoU73.03 | 18 | |
| Organ-level semantic segmentation | Grape Fruit | IoU90.63 | 18 | |
| Organ-level semantic segmentation | Wheat | Spike IoU85.54 | 18 | |
| Organ-level semantic segmentation | Rice | Green Veg IoU87.11 | 18 | |
| Classification | Banana | Average Accuracy90.77 | 16 | |
| Object Counting | Wheat Spikes (val) | MAE3.62 | 11 |