MS2MetGAN: Latent-space adversarial training for metabolite-spectrum matching in MS/MS database search
About
Database search is a widely used approach for identifying metabolites from tandem mass spectra (MS/MS). In this strategy, an experimental spectrum is matched against a user-specified database of candidate metabolites, and candidates are ranked such that true metabolite-spectrum matches receive the highest scores. Machine-learning methods have been widely incorporated into database-search-based identification tools and have substantially improved performance. To further improve identification accuracy, we propose a new framework for generating negative training samples. The framework first uses autoencoders to learn latent representations of metabolite structures and MS/MS spectra, thereby recasting metabolite-spectrum matching as matching between latent vectors. It then uses a GAN to generate latent vectors of decoy metabolites and constructs decoy metabolite-spectrum matches as negative samples for training. Experimental results show that our tool, MS2MetGAN, achieves better overall performance than existing metabolite identification methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Metabolite Identification | CASMI FP 2017 | Accuracy86.3 | 18 | |
| Metabolite Identification | GNPS S | Accuracy75.65 | 18 | |
| Metabolite Identification | EMBL-MCF | Accuracy93.25 | 18 | |
| Metabolite Identification | MONA | Accuracy77.19 | 18 | |
| Metabolite Identification | CASMI SP 2017 | Accuracy90.48 | 18 | |
| Metabolite Identification | CASMI FP 2016 | Accuracy87.4 | 18 | |
| Metabolite Identification | GNPS-M | Accuracy79.9 | 18 | |
| Metabolite Identification | CASMI SP 2016 | Accuracy86.07 | 18 | |
| Metabolite Identification | CASMI 2022P | Accuracy37.89 | 18 | |
| Database Searching | MetaCyc Database (test) | MIDAS66.67 | 1 |