Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Data Augmentation for Compositional Data: Advancing Predictive Models of the Microbiome

About

Data augmentation plays a key role in modern machine learning pipelines. While numerous augmentation strategies have been studied in the context of computer vision and natural language processing, less is known for other data modalities. Our work extends the success of data augmentation to compositional data, i.e., simplex-valued data, which is of particular interest in the context of the human microbiome. Drawing on key principles from compositional data analysis, such as the Aitchison geometry of the simplex and subcompositions, we define novel augmentation strategies for this data modality. Incorporating our data augmentations into standard supervised learning pipelines results in consistent performance gains across a wide range of standard benchmark datasets. In particular, we set a new state-of-the-art for key disease prediction tasks including colorectal cancer, type 2 diabetes, and Crohn's disease. In addition, our data augmentations enable us to define a novel contrastive learning model, which improves on previous representation learning approaches for microbiome compositional data. Our code is available at https://github.com/cunningham-lab/AugCoDa.

Elliott Gordon-Rodriguez, Thomas P. Quinn, John P. Cunningham• 2022

Related benchmarks

TaskDatasetResultRank
Disease predictionTask 1 Colorectal Cancer (test)
AUROC0.79
10
Disease predictionTask 2 Type 2 Diabetes (test)
AUROC0.83
10
Disease predictionTask 3 Crohn's Disease (test)
AUROC100
10
Disease predictionTask 4
AUROC65
10
Disease predictionTask 6
AUROC0.84
6
Disease predictionTask 7
AUROC0.76
6
Disease predictionTask 5
AUROC1
4
Disease predictionTask 8
AUROC72
4
Disease predictionTask 9
AUROC0.91
4
Disease predictionTask 12
AUROC0.66
4
Showing 10 of 12 rows

Other info

Code

Follow for update