Accéder directement au contenu Accéder directement à la navigation
Communication dans un congrès

MixtComp software: Model-based clustering/imputation with mixed data, missing data and uncertain data

Christophe Biernacki 1, 2
1 MODAL - MOdel for Data Analysis and Learning
Inria Lille - Nord Europe, LPP - Laboratoire Paul Painlevé - UMR 8524, METRICS - Evaluation des technologies de santé et des pratiques médicales - ULR 2694, Polytech Lille - École polytechnique universitaire de Lille, Université de Lille, Sciences et Technologies
Abstract : The "Big Data" paradigm involves large and complex data sets. Complexity includes both variety (mixed data: continuous and/or categorical and/or ordinal and/or functional...) and missing, or partially missing (binned), items. Clustering is a suitable response for volume but it needs also to deal with complexity, especially as volume promotes complexity emergence. Model-based clustering has demonstrated many theoretical and practical successes (McLachlan 2000), including multivariate mixed data with conditional (Biernacki 2013) or without conditional independence (Marbac et al. 2014). In addition, this full generative design allows to straightforwardly handle missing or binned data (McLachlan 2000; Biernacki 2007). Model estimation can also be performed by simple EM-like algorithms, as the SEM one (Celeux and Diebolt 1985). MixComp is a new R software, written in C++, implementing model-based clustering for multivariate missing/binned/mixed data under the conditional independence assumption (Goodman 1974). Current implemented mixed data are continuous (Gaussian), categorical (multinomial) and integer (Poisson) ones. However, architecture of MixComp is designed for incremental insertion of new kinds of data (ordinal, ranks, functional...) and related models. Currently, MixComp is not freely available as an R package but will be soon freely available through a specific web interface. Beyond its clustering task, it allows also to perform imputation of missing/binned data (with associated confidence intervals) by using the mixture model ability for density estimation as well. Topics will include: mixture models - conditional independence - SEM algorithm - model selection criteria Prerequisites: elementary knowledge of general statistical concepts, of mixture models, of EM algorithm and of standard model selection criteria is assumed. Moreover, basic programming in R is useful.
Type de document :
Communication dans un congrès
Liste complète des métadonnées

https://hal.archives-ouvertes.fr/hal-01253393
Contributeur : Christophe Biernacki <>
Soumis le : dimanche 10 janvier 2016 - 10:17:41
Dernière modification le : vendredi 27 novembre 2020 - 14:18:02

Annexe

Identifiants

  • HAL Id : hal-01253393, version 1

Collections

Citation

Christophe Biernacki. MixtComp software: Model-based clustering/imputation with mixed data, missing data and uncertain data. MISSDATA 2015, Jun 2015, Rennes, France. ⟨hal-01253393⟩

Partager

Métriques

Consultations de la notice

303

Téléchargements de fichiers

143