A sparse variable selection procedure in model-based clustering

Abstract

Owing to the increase of high-dimensional datasets, the variable selection for clustering is an important challenge. In the context of Gaussian mixture clustering, we recast the variable selection problem into a general model selection problem. Our procedure first consists of using a ℓ1-regularization method to build a data-driven model subcollection. Second, the maximum loglikelihood estimators (MLEs) are obtained using the EM algorithm. Next a non asymptotic penalized criterion is proposed to select the number of mixture components and the relevant clustering variables simultaneously. A general model selection theorem for MLEs with a random model collection is established. It allows one to derive the penalty shape of the criterion, which depends on the complexity of the random model collection. In practice, the criterion is calibrated using the so-called slope heuristics. The resulting procedure is illustrated on two simulated examples. Finally, an extension to a more general modeling of irrelevant clustering variables is presented.