MixStatSeq

Mixture-based procedures for statistical analysis of RNA-seq data

March 1, 2014

Coordinator: Cathy Maugis-Rabusseau
Reference: ANR-13- JS01-0001-01
Date: 03/2014- 08/2018

Table of Contents

Members

Gilles Celeux (INRIA Saclay-Ile-de-France)
Antoine Godichon-Baggioni (Post-doc 2016-2017)
Sandrine Laguerre (LISBP)
Béatrice Laurent-Bonneau (IMT/INSA Toulouse)
Clément Marteau (ICJ/Lyon 1)
Marie-Laure Martin-Magniette (AgroParisTech/URGV)
Cathy Maugis-Rabusseau (IMT/INSA Toulouse)
Andréa Rau (INRA Jouy-en-Josas)

Summary

In recent years, significant advances in next generation sequencing technologies have made RNA sequencing (RNA-seq) a popular choice for studies of gene expression. Although microarrays and RNA-seq both aim to characterize transcriptional activity, the statistical tools developed for the analysis of the former are ill-suited to the latter. To date, the methodological developments for RNA-seq data have mainly focused on normalization and differential analysis, but the testing procedures currently proposed lack power to detect differentially expressed genes; little methodological research has been devoted to the identification of coexpressed genes in RNA-seq data. However, as costs for RNA-seq experiments continue to decrease, it is likely that such studies will replace the use of microarrays for many applications involving investigations of the transcriptome. It is therefore crucial to pursue research on the development of statistical methods that allow biologists to exploit RNA-seq data. In the MixStatSeq project, we focus on three main biological questions for RNA-seq data: (i) the detection of differentially expressed genes, (ii) the detection of co-expressed gene clusters, and (iii) the detection of invariant genes, i.e., those with stable expression in several biological conditions. To address these three biological questions, we propose to develop a suite of statistically sound methods based on mixture models. For the analysis of differential expression, two points of view are envisaged. In the first, we aim to construct a powerful testing procedure by first performing a gene clustering step, followed by a testing procedure for each subgroup of genes and a correction for multiple testing. In the second, we will investigate model-based clustering procedures that directly cluster genes into groups representing differential and nondifferential expression. For the detection of co-expressed gene clusters, we will extend our preliminary work on the use of mixture models. In particular, as the number of RNA-seq experiments will continue to increase in the coming years, it is crucial to develop variable selection procedures, as well as to incorporate external biological knowledge, in order to improve the interpretability of gene clustering. For the detection of invariant genes, we aim to develop a non-asymptotic multiple hypothesis testing procedure to test a single distribution against a mixture of distributions, and to study its theoretical properties to ensure a powerful test. Beyond the biological application, such a development is a difficult theoretical challenge. Throughout the MixStatSeq project, the team will foster collaborations with biologists of several laboratories to validate chosen models and test the developed approaches on real RNA-seq data obtained from different organisms. The originality of the MixStatSeq project will be the continuous exchange between theoretical, methodological and applied research, including the assessment of biologists, in order to ensure the immediate potential impact of the developed procedures. Moreover, beyond the RNA-seq data study, this project will provide new theoretical and methodological knowledge for the study of count data with mixtures.

Publications

Mondet, F., Rau, A., Klopp, C., Rohmer, M. Severac, D., Le Conte, Y., and Alaux, C.(2018). Transcriptome profiling of the honeybee parasite Varroa destructor provides new biological insights into the mite adult life cycle. BMC Genomics, 19(1):328.
Celeux, G., Maugis-Rabusseau, C. and Sedki, M. (2018). Variable selection in model-based clustering and discriminant analysis with regularization approach. Accepted in Advances in Data Analysis and Classification. [Hal-01053784]
Godichon-Baggioni, A., Maugis-Rabusseau, C. and Rau, A. (2019) Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data. Journal of Applied Statistics, 46, 47-65.
Sauvage, C., Rau, A., Aichholz, C., Chadoeuf, J., Sarah, G., Ruiz, M., Santoni, S., Causse, M., David, J., Glémin, S. (2017) Domestication rewired gene expression and nucleotide diversity patterns in tomato. The Plant Journal, 91(4):631-645.
Laurent, B. , Marteau, C. and Maugis-Rabusseau, C. (2018). Multidimensional two-component Gaussian mixtures detection. Annales de l’IHP (série B), 54(2), 842-865.
Rau, A. & Maugis-Rabusseau, C. (2018) Transformation and model choice for RNA-seq co-expression analysis. Briefings in Bioinformatics, 19(3), 65-76.
Gadat, S. & Kahn, J. & Marteau, C. & Maugis-Rabusseau, C. Parameter recovery in two-component contamination mixtures: the L2 strategy (2020). Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 56(2), pp. 1391–1418.
G. Rigaill, S. Balzergue, V. Brunaud, E. Blondet, A. Rau, O. Rogier, J. Caius, C. Maugis-Rabusseau, L. Soubigou-Taconnat, S. Aubourg, C. Lurin, M.-L. Martin-Magniette and E. Delannoy(2016) Synthetic datasets for the identification of key ingredients for RNA-seq differential analysis. Briefings in bioinformatics, 19(1), 65-76.
C. Biernacki & C. Maugis-Rabusseau. Chapitre 9 : “High-dimensional clustering”. Model choice and Model aggregation, sous la direction de F.BERTRAND, J-J. DROESBEKE, G. SAPORTA, C. THOMAS-AGNAN Edition Technip. Septembre 2017.
Martin-Magniette, M.-L., Maugis-Rabusseau, C. and Rau, A. Chapitre 10 : “Clustering of co-expressed genes”. Model choice and Model aggregation, sous la direction de F. BERTRAND, J-J. DROESBEKE, G. SAPORTA, C. THOMAS-AGNAN Edition Technip. Septembre 2017.
Rau, A., Maugis-Rabusseau, C. , Martin-Magniette, M.-L. and Celeux, G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31 (9), 1420-1427.
Laurent, B., Marteau, C. and Maugis-Rabusseau, C. (2016). Non asymptotic detection of two component mixtures with unknown means. Bernoulli, Volume 22, Number 1, 242-274.

Softwares

The Bioconductor package coseq
The R package HTSCluster
The R package HTSDiff
The R-package SelvarMix

Workshops

Workshop de clotûre

Le workshop de clôture du projet ANR MixStatSeq a eu lieu les 21 et 22 juin 2018 à Paris (AgroParisTech) sur le thème des modèles de mélange, de la théorie à la pratique.

L’inscription est gratuite mais obligatoire

Toutes les informations sont disponibles ici

Working day: Hypothesis testing and mixtures

Une journée d’exposés sur le thème “tests d’hypothèse et modèles de mélange” est organisée le mardi 5 avril 2016 à l’Institut de Mathématiques de Toulouse dans le cadre du projet ANR MixStatSeq.

Orateurs :

Cristina Butucea (Université Paris-Est Marne-la-Vallée)
Judith Rousseau (Université Paris-Dauphine)
Nicolas Verzelen (INRA Montpellier)
Clément Marteau (Université Lyon 1)

Organisateurs : Béatrice Laurent, Clément Marteau et Cathy Maugis-Rabusseau

Programme :

9h00-9h30: Accueil
9h30-10h30: Exposé de Cristina Butucea

Titre : Mixture models with symmetric errors Résumé : A semiparametric mixture of two populations with the same probability density and different locations can be identified and estimated under the assumption that the common probability density is symmetric. We use identifiability results in Bordes et al. (2006) and propose a new estimation algorithm based on techniques from the theory of inverse problems. We consider next a semiparametric mixture of regression models and study its identifiability. We propose an estimation procedure of the mixing proportion and of the location functions locally at a fixed point. Our estimation procedure is based on the symmetry of the errors' distribution and does not require finite moments on the errors. We establish under mild conditions minimax properties and asymptotic normality of our estimators. We study the finite sample performance on synthetic data and on the positron emission tomography imaging data in a cancer study in Bowen et al. (2012).

10h30-10h50: Pause café
10h50-11h50: Exposé de Judith Rousseau (Slides)

Titre: Testing hypotheses via a mixture estimation model Résumé: We consider a novel paradigm for Bayesian testing of hypotheses and Bayesian model comparison. Our alternative to the traditional construction of posterior probabilities that a given hypothesis is true or that the data originates from a specific model is to consider the models under comparison as components of a mixture model. We therefore replace the original testing problem with an estimation one that focus on the probability weight of a given model within a mixture model. We analyze the sensitivity on the resulting posterior distribution on the weights of various prior modeling on the weights. We stress that a major appeal in using this novel perspective is that generic improper priors are acceptable, while not putting convergence in jeopardy. Among other features, this allows for a resolution of the Lindley-Jeffreys paradox. When using a reference Beta B(a,a) prior on the mixture weights, we note that the sensitivity of the posterior estimations of the weights to the choice of a vanishes with the sample size increasing and avocate the default choice a=0.5, derived from Rousseau and Mengersen (2011). Another feature of this easily implemented alternative to the classical Bayesian solution is that the speeds of convergence of the posterior mean of the weight and of the corresponding posterior probability are quite similar.

12h-14h : Repas
14h-15h: Exposé de Nicolas Verzelen (Slides)

Titre : Détection de modèle de mélange gaussien parcimonieux et classification non supervisée Résumé: L’objectif de cet exposé est de comparer la difficulté de deux problèmes statistiques: (i) classification supervisée (ii) classification non supervisée. Pour ce faire, on adoptera une approche “model-based” en considérant des modèles de mélange gaussien en grande dimension. Après une revue sélective de littérature sur les modèles de mélange, on s’intéressera aux vitesses optimales de classification, ce afin de caractériser les régimes dans lesquelles une classification consistante est réalisable.

15h00-15h20 : Pause café
15h20-16h20 : Exposé de Clément Marteau [(Slides)](/MixStatSeqworkshop/Slides-Marteau-Toulouse.pdf]

Titre: Multidimensional two component Gaussian mixtures detection. Résumé: Let (X_1,…,X_n) be a d-dimensional i.i.d sample from a distribution with density f. The problem of detection of a two-component mixture is considered. Our aim is to decide whether f is the density of a standard Gaussian random d-vector (f=ϕ_d) against f is a two-component mixture: f=(1−ε)ϕ_d+εϕ_d(.−μ) where (ε,μ) are unknown parameters. Optimal separation conditions on ε,μ,nand the dimension d are established, allowing to separate both hypotheses with prescribed errors. Several testing procedures are proposed and two alternative subsets are considered.

Inscription : L’inscription est gratuite. Si vous souhaitez participer à cette journée d’exposés, merci d’envoyer un mail à l’adresse cathy.maugis-at-insa-toulouse.fr avec le sujet “Inscription 5 avril” (en précisant si vous prenez le déjeuner).