Fast Projection onto the Capped Simplex with Applications to Sparse Regression in Bioinformatics

About

We consider the problem of projecting a vector onto the so-called k-capped simplex, which is a hyper-cube cut by a hyperplane. For an n-dimensional input vector with bounded elements, we found that a simple algorithm based on Newton's method is able to solve the projection problem to high precision with a complexity roughly about O(n), which has a much lower computational cost compared with the existing sorting-based methods proposed in the literature. We provide a theory for partial explanation and justification of the method. We demonstrate that the proposed algorithm can produce a solution of the projection problem with high precision on large scale datasets, and the algorithm is able to significantly outperform the state-of-the-art methods in terms of runtime (about 6-8 times faster than a commercial software with respect to CPU time for input vector with 1 million variables or more). We further illustrate the effectiveness of the proposed algorithm on solving sparse regression in a bioinformatics problem. Empirical results on the GWAS dataset (with 1,500,000 single-nucleotide polymorphisms) show that, when using the proposed method to accelerate the Projected Quasi-Newton (PQN) method, the accelerated PQN algorithm is able to handle huge-scale regression problem and it is more efficient (about 3-6 times faster) than the current state-of-the-art methods.

Andersen Ang, Jianzhu Ma, Nianjun Liu, Kun Huang, Yijie Wang• 2021

Related benchmarks

Task	Dataset	Result
Feature Selection	Simulated GWAS 50,000 SNPs 25 Significant SNPs High LD Sample size 150	True Positives25	3
Feature Selection	Simulated GWAS 50,000 SNPs 25 Significant SNPs Mixed LD Sample size 150	Correctly Identified SNPs24	3
Feature Selection	Simulated GWAS 50,000 SNPs 25 Significant SNPs Low LD Sample size 150	Correct Identifications25	3
Feature Selection	Simulated GWAS 50,000 SNPs 25 Significant SNPs High LD (Sample size 50)	Correct Identifications Count19	3
Feature Selection	Simulated GWAS (50,000 SNPs, 25 Significant SNPs, Mixed LD, Sample size 50)	TP Count16	3
Feature Selection	Simulated GWAS 50,000 SNPs 25 Significant SNPs Low LD Sample size 50	True Positives15	3
Sparse Regression	Simulated GWAS Mixed LD 50,000 SNPs, 25 significant SNPs (Sample size 1000)	Correct Identification Rate2.50e+4	3
Sparse Regression	Simulated GWAS High LD 50,000 SNPs, 25 significant SNPs (Sample size 100)	True Positives251.5	3
Sparse Regression	Simulated GWAS Mixed LD 50,000 SNPs, 25 significant SNPs (Sample size 100)	Correct Identifications242	3
Sparse Regression	Simulated GWAS Low LD 50,000 SNPs, 25 significant SNPs (Sample size 100)	Correct Identifications Count233.1	3

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord