Fast Projection onto the Capped Simplex with Applications to Sparse Regression in Bioinformatics
About
We consider the problem of projecting a vector onto the so-called k-capped simplex, which is a hyper-cube cut by a hyperplane. For an n-dimensional input vector with bounded elements, we found that a simple algorithm based on Newton's method is able to solve the projection problem to high precision with a complexity roughly about O(n), which has a much lower computational cost compared with the existing sorting-based methods proposed in the literature. We provide a theory for partial explanation and justification of the method. We demonstrate that the proposed algorithm can produce a solution of the projection problem with high precision on large scale datasets, and the algorithm is able to significantly outperform the state-of-the-art methods in terms of runtime (about 6-8 times faster than a commercial software with respect to CPU time for input vector with 1 million variables or more). We further illustrate the effectiveness of the proposed algorithm on solving sparse regression in a bioinformatics problem. Empirical results on the GWAS dataset (with 1,500,000 single-nucleotide polymorphisms) show that, when using the proposed method to accelerate the Projected Quasi-Newton (PQN) method, the accelerated PQN algorithm is able to handle huge-scale regression problem and it is more efficient (about 3-6 times faster) than the current state-of-the-art methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Feature Selection | Simulated GWAS 50,000 SNPs 25 Significant SNPs High LD Sample size 150 | True Positives25 | 3 | |
| Feature Selection | Simulated GWAS 50,000 SNPs 25 Significant SNPs Mixed LD Sample size 150 | Correctly Identified SNPs24 | 3 | |
| Feature Selection | Simulated GWAS 50,000 SNPs 25 Significant SNPs Low LD Sample size 150 | Correct Identifications25 | 3 | |
| Feature Selection | Simulated GWAS 50,000 SNPs 25 Significant SNPs High LD (Sample size 50) | Correct Identifications Count19 | 3 | |
| Feature Selection | Simulated GWAS (50,000 SNPs, 25 Significant SNPs, Mixed LD, Sample size 50) | TP Count16 | 3 | |
| Feature Selection | Simulated GWAS 50,000 SNPs 25 Significant SNPs Low LD Sample size 50 | True Positives15 | 3 | |
| Sparse Regression | Simulated GWAS Mixed LD 50,000 SNPs, 25 significant SNPs (Sample size 1000) | Correct Identification Rate2.50e+4 | 3 | |
| Sparse Regression | Simulated GWAS High LD 50,000 SNPs, 25 significant SNPs (Sample size 100) | True Positives251.5 | 3 | |
| Sparse Regression | Simulated GWAS Mixed LD 50,000 SNPs, 25 significant SNPs (Sample size 100) | Correct Identifications242 | 3 | |
| Sparse Regression | Simulated GWAS Low LD 50,000 SNPs, 25 significant SNPs (Sample size 100) | Correct Identifications Count233.1 | 3 |