TaxaBind: A Unified Embedding Space for Ecological Applications
About
We present TaxaBind, a unified embedding space for characterizing any species of interest. TaxaBind is a multimodal embedding space across six modalities: ground-level images of species, geographic location, satellite image, text, audio, and environmental features, useful for solving ecological problems. To learn this joint embedding space, we leverage ground-level images of species as a binding modality. We propose multimodal patching, a technique for effectively distilling the knowledge from various modalities into the binding modality. We construct two large datasets for pretraining: iSatNat with species images and satellite images, and iSoundNat with species images and audio. Additionally, we introduce TaxaBench-8k, a diverse multimodal dataset with six paired modalities for evaluating deep learning models on ecological tasks. Experiments with TaxaBind demonstrate its strong zero-shot and emergent capabilities on a range of tasks including species classification, cross-model retrieval, and audio classification. The datasets and models are made available at https://github.com/mvrl/TaxaBind.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CUB-200 | Accuracy75 | 106 | |
| Regression | California Housing | -- | 71 | |
| Geolocation | AVG (test) | City Acc (25km)0.1 | 10 | |
| Species Distribution Modeling | SDM single-layer probing | Accuracy (%)3.1 | 9 | |
| Population Density Regression | US Population Density | R-squared (%)41 | 9 | |
| Species Distribution Modeling | SDM | Accuracy3.04 | 9 | |
| Plant Traits Regression | Plant traits single-layer probing | R² (%)56.9 | 9 | |
| Biomes Classification | Biomes single-layer probing | F1 Score59.3 | 9 | |
| Image-to-Audio Retrieval | GeoSound (Sentinel Imagery, scale=1) | R@10%23.5 | 9 | |
| Median Income Regression | US County-level Median Household Income USDA 2021 | R² (%)15 | 9 |