(528d) Imputation of Single-Cell Expression Data
Recently, the rapid emergence and development
of single-cell technologies have enabled a tremendous boost in the study of
biological processes such as cell differentiation 1
and reprogramming 2.
In particular, single-cell RNA-sequencing (scRNA-seq) is capable of quantifying
the RNA abundance in hundreds or even thousands of individual cells in a single
This technology has led to the identification and characterization of new cell
types within heterogeneous tissues and to the reconstruction of developmental
trees overcoming to the limitation of bulk RNA-sequencing techniques, which are
not able to detect cell-to-cell variability 4.
One drawback of the analysis at single cell resolution compared to
population-averaged gene expression is the extremely low availability of mRNA
molecules, which often leads to a failure in the detection of expressed genes
(i.e. dropout events). These missing values are usually represented by zeros
and mixed with true zero values (i.e. when the gene is not expressed). The
traditional statistical tools used for the analysis of expression datasets such
as dimensional reduction techniques (PCA, ICA, t-SNE, etc.) and differential
gene expression algorithms do not accommodate the dropout events, consequently
they might give rise to misleading findings. Recently, several
methods were developed in order to account for the presence of dropouts. More
specifically we can distinguish two main classes: approaches based on
zero-inflated models such as ZIFA 5
and ZINB-WaVE 6,
which provide a low-dimensional visualization of the data; strategies implemented
for the replacement of dropout events (i.e. by imputation) with non-zeros
entries representing the undetected gene expression level based on cell
similarities such as CIDR 7,
and DrImpute 10.
In this work, we
implemented a two-state model-based algorithm for the estimation of dropouts that
naturally accommodate for the stochastic and bursty behavior of gene expression
in single cell data. More specifically, the two-state model describes the gene
expression processes involving (1) the promoter switching between ON and OFF
state, (2) in the OFF state (a closed chromatin state), the gene is not
accessible for the transcription and (3) in the ON state (an open chromatin
state), the gene transcription could occur, producing the mRNA molecules in
A total of four kinetic parameters fully describe the two-state model, namely Kon
(rate of activation), Koff
(rate of inactivation), Kt
(rate of transcription) and Kd
(rate constant of degradation):
Based on previous
dropout events occur more frequently in gene expressed at lower levels than
those expressed at a higher magnitude. Hence, to account for zero inflation,
we modeled the dropout rate as a negative exponential function with decay
constant a. The final probability distribution of the observed mRNA
count m can be described by a mixture of two components:
where the first component represents the dropout
probability function and the second one is the distribution of the actual
(true) gene expression mtrue based on the two state model
The main idea behind our approach is that similar cells
should share the same parameters of the two-state model. Such similarity can be
defined by the user based on prior knowledge (e.g. cell type) or by employing
our previous clustering algorithm called CALISTA 13.
More specifically, our method starts by assigning cells into a priori-defined
groups (e.g. cell clusters or cell types); then it estimates the optimal a Â (over
all cells) and parameters K(
one set for each gene and cell group) by maximizing the probability
(likelihood) for each cell to be in the corresponding group based on the cellâs
gene expression data. Here, we assumed that the dropout rate follows the same
distribution profile over all cells (i.e. cells sampled from the same
experimental population). However, in presence of technical variations (e.g. cells
measured from different batches), our strategy is able to account for these
undesired effects by estimating different a during the optimization
In the next step, we employed the
optimal parameter sets estimated to replace the dropout values as follows: for
each cell group and each gene we imputed zeros, which might be either missing
values or true biological zeros, by randomly sampling new mRNA counts from the
distribution described by the corresponding optimal a.
To test the efficacy of our
algorithm, we applied it on two publicly available
scRNA-seq datasets 2,14,
including Usoskin et al. study using single mouse neural cells 14 and
Treutlein et al. study on mouse embryonic fibroblast differentiation
into neurons 2.
Figure 1 and 2 depict the results of our algorithm on the aforementioned
datasets by using, as cell assignments for the dropout estimation and
imputation, the clustering outcome of CALISTA 13
with number of clusters equals to 4. Our imputation approach improves
significantly the visualization of the data and it is able to distinguish clearly
the cell subpopulations (Fig. 1-2) and the differentiation trajectory 2,13
S. et al. Dynamics of lineage commitment revealed by single-cell
transcriptomics of differentiating embryonic stem cells. Nat. Commun. 8,
B. et al. Dissecting direct reprogramming from fibroblast to neuron
using single-cell RNA-seq. Nature 534, 391â395 (2016).
C. et al. Comparative Analysis of Single-Cell RNA Sequencing Methods. Mol.
Cell 65, 631â643.e4 (2017).
O., Teichmann, S. A. & Marioni, J. C. Computational and analytical
challenges in single-cell transcriptomics. Nat. Rev. Genet. 16,
E. et al. ZIFA: Dimensionality reduction for zero-inflated single-cell
gene expression analysis. Genome Biol. 16, 241 (2015).
D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general and
flexible method for signal extraction from single-cell RNA-seq data. Nat.
Commun. 9, 284 (2018).
P., Troup, M. & Ho, J. W. K. CIDR: Ultrafast and accurate clustering
through imputation for single-cell RNA-seq data. Genome Biol. 18,
D. van et al. MAGIC: A diffusion-based imputation method reveals
gene-gene interactions in single-cell RNA-sequencing data. bioRxiv
111591 (2017). doi:10.1101/111591
W. V. & Li, J. J. An accurate and robust imputation method scImpute for
single-cell RNA-seq data. Nat. Commun. 9, 997 (2018).
I.-Y., Gong, W., Koyano-Nakagawa, N. & Garry, D. DrImpute: Imputing dropout
events in single cell RNA sequencing data. bioRxiv 181479 (2017).
B., Neuert, G. & van Oudenaarden, A. Using Gene Expression Noise to
Understand Gene Regulation. Science (80-. ). 336, (2012).
P. V, Silberstein, L. & Scadden, D. T. Bayesian approach to single-cell
differential expression analysis. Nat. Methods 11, 740â742
Gao, N., Hartmann, T. & Gunawan, R. CALISTA: Clustering And Lineage
Inference in Single-Cell Transcriptional Analysis. bioRxiv 257550
D. et al. Unbiased classification of sensory neuron types by large-scale
single-cell RNA sequencing. Nat. Neurosci. 18, 145â153 (2014).
1: PCA plot calculated from raw (up) and imputed (down) expression data of
mouse embryonic fibroblast differentiation into neurons in Treutlein et al.
2: PCA plot calculated from raw (up) and imputed (down) expression data Usoskin et al.
study using single mouse neural cells 14.