Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Deep Unsupervised Feature Selection by Discarding Nuisance and Correlated Features

About

Modern datasets often contain large subsets of correlated features and nuisance features, which are not or loosely related to the main underlying structures of the data. Nuisance features can be identified using the Laplacian score criterion, which evaluates the importance of a given feature via its consistency with the Graph Laplacians' leading eigenvectors. We demonstrate that in the presence of large numbers of nuisance features, the Laplacian must be computed on the subset of selected features rather than on the complete feature set. To do this, we propose a fully differentiable approach for unsupervised feature selection, utilizing the Laplacian score criterion to avoid the selection of nuisance features. We employ an autoencoder architecture to cope with correlated features, trained to reconstruct the data from the subset of selected features. Building on the recently proposed concrete layer that allows controlling for the number of selected features via architectural design, simplifying the optimization process. Experimenting on several real-world datasets, we demonstrate that our proposed approach outperforms similar approaches designed to avoid only correlated or nuisance features, but not both. Several state-of-the-art clustering results are reported.

Uri Shaham, Ofir Lindenbaum, Jonathan Svirsky, Yuval Kluger• 2021

Related benchmarks

TaskDatasetResultRank
Feature Selection1000x4-3 +2NF synthetic
Mean Proportion Correct100
8
Feature Selection2NF synthetic 1000x4-5
Mean Correct Features Proportion100
8
Feature SelectionSynthetic 1000x4-10 +2NF
Mean Selection Proportion100
8
Feature SelectionSynthetic 2000x20-5 +10NF
Mean Selection Rate95
8
Feature SelectionSynthetic 2000x20-20 +10NF
Mean Correct Feature Proportion99
8
Feature SelectionSynthetic 2000x30-10 +15NF
Mean Correct Feature Proportion96
8
Feature Selection2000x30-20 +15NF synthetic
Mean Proportion Correct Features96
8
Feature SelectionSynthetic 1000x10-3 +5NF
Mean Correct Feature Proportion95
8
Feature Selection1000x10-5 +5NF synthetic
Mean Proportion Correct Features92
8
Feature Selection1000x10-10 +5NF synthetic
Mean Proportion Correct96
8
Showing 10 of 22 rows

Other info

Follow for update