(182n) An MCMC-Based Approach to Inferring Cell Counts in Diseased Tissue | AIChE

(182n) An MCMC-Based Approach to Inferring Cell Counts in Diseased Tissue


Wang, M. - Presenter, University of Pittsburgh
Shoemaker, J. E., University of Pittsburgh

Currently, gene expression data is highly underutilized. Gene expression data is primarily used to rapidly probe the genome to discover gene activity associated with specific phenotypes or used to infer gene regulatory networks active within pure cell cultures. However, gene expression data developed from complex tissue contains dynamic information of changes in intracellular signaling and changes in the cellular makeup of the sample itself. Engineers and scientists must develop methods to maximally exploit biological data for model construction.

We and others have proposed that gene expression data can be used to quantify the dynamics of immune cells during disease. The immune system coordinates the activity of diverse immune cells and intracellular signaling pathways to identify and protect against disease. Aberrant immune responses have been associated with different influenza infection outcomes [1]. However, a systems-level understanding of a healthy immune response is needed to improve treatment of severe infections. In this work, we present a novel algorithm for inferring cell counts from influenza-infected lung gene expression data and compare the performance of our algorithm to the current state-of-the-art inference techniques. Lastly, we discuss how current algorithms can be integrated into mathematical model development protocols.


Transcriptional profiling of diseased tissues and cells has resulted in large, publicly-available gene expression datasets. Though changes in gene transcript levels in tissue might be due to both gene regulation and changes in cell populations, most research has focused on identifying genes differentially expressed between phenotypes using traditional statistical tests or on pathway activity inferred using a variety of methods. Inferring changes in cellular demographics remains a challenge and no method to date has focused on temporal (e.g. time-point) samples.

Multiple deconvolution algorithms have been developed to predict cellular demographics through the use of gene expression data. Basic deconvolution postulates that expression profiles of cell mixtures are linear combinations of the expression profiles of pure cells. Thus, linear regression tools are used to infer cell proportions in a given sample. Examples include: modified linear least-squares regression (modified LLSR) [2], cell-type identification by estimating relative subsets of RNA transcripts with support vector regression (CIBERSORT) [3], and digital cell quantification with elastic net regulation (DCQ) [4]. These tools are designed to de-convolute one sample at a time. Therefore to estimate dynamics of cells, they need to be run for each timepoint to generate predictions. Furthermore, the predictions can only be accurately determined if all cell types of the sample and their pure-cell expression data are known a priori.

Another approach is to apply dynamic clustering to identify co-regulated modules (groups) of genes and then apply bioinformatics approaches to associate each module with specific cell populations, e.g. cell type enrichment (CTen) [5]. Different from the former deconvolution approach, CTen implements only one computation for time-series data. Although it determines relative fold changes as opposed to cell proportions, CTen requires less prior knowledge of sample composition.

The reported demonstrations of these deconvolution algorithms are mostly conducted by artificial datasets or cell mixtures with simple compositions. When applied to complex and time-course data, we found the accuracy of these algorithms was insufficient for several key immune cell types (including CD8+ T Cells, CD4+ T cells, B Cells, and lung resident macrophages). For modified LLSR, CIBERSORT or DCQ, they are lacking consideration of dependency and contiguity among time points. Instead CTen takes into account the time course profile of each gene. It was tested with the same temporal data and accurately predicted dynamic changes of some immune cell types (showing high correlation for lung resident macrophages). While CTen showed a lower accuracy for a few other immune cell subsets, which might be due to its gene marker database. Based on these results, we have developed a novel approach to cell count inference that better predicts cell quantity change in tissue across time using temporal transcriptomic data.


We constructed a novel cell signature database from data available from ImmGen [6]. Expression intensities of transcripts are averaged across replicates and mapped to associated genes. To reduce computation complexity, genes with small variation across the cell type library were removed. To find genes that best distinguish a certain cell type, we establish an optimization problem which seeks a set of K genes to minimize the correlation coefficient between this cell type and all others. We use MCMC to identify multiple solutions that satisfactorily reduce the cost function. In doing so, we create a database of cell signatures in which each cell type has 3 cell signature sets. The cell signature sets allow us to provide multiple estimates of the cell counts and provide a confidence score.

To evaluate the MCMC-based inference algorithm, differentially expressed genes (DEG) are obtained from microarray data of influenza-infected mouse lung tissue [7]. Cell enrichment of the DEGs is detected in comparison with the cell signature database by Fisher’s exact test and log fold changes of immune cell counts are assumed to correspond with log fold changes of mean expression profiles of the cell markers. For comparison, we applied the same lung tissue dataset to 4 common algorithms: modified LLSR, CIBERSORT, DCQ and CTen. Algorithm performance is measured by the sum of squared errors.

Conclusions and Discussion

Here we present a new approach for inference of immune cell population changes across time. Compared with other popular algorithms, our method demonstrated improved prediction accuracy of the log fold change of several common immune cells. This improved quantification of cell count changes in samples may help computational biologists with characterizing immune response and/or disease pathology, and mining disease transcriptomics. Future work will be to integrate this inference approach with other biological data to promote tissue-level systems inference. Furthermore, applications of our model are beyond inference. The cell signature database we generated contains unique information for each immune cell population. Study on these signatures, combining with predictions of dynamic changes for key immune cells in disease, may provide new insights on potential biomarkers for disease diagnosis.


  1. Cilloniz, C., et al., Lethal influenza virus infection in macaques is associated with early dysregulation of inflammatory related genes. PLoS Pathog, 2009. 5(10): p. e1000604.
  2. Abbas, A.R., et al., Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PLoS One, 2009. 4(7): p. e6098.
  3. Newman, A.M., et al., Robust enumeration of cell subsets from tissue expression profiles. Nat Methods, 2015. 12(5): p. 453-7.
  4. Altboum, Z., et al., Digital cell quantification identifies global immune cell dynamics during influenza infection. Mol Syst Biol, 2014. 10: p. 720.
  5. Shoemaker, J.E., et al., CTen: a web-based platform for identifying enriched cell types from heterogeneous microarray data. BMC Genomics, 2012. 13: p. 460.
  6. Heng, T.S., M.W. Painter, and C. Immunological Genome Project, The Immunological Genome Project: networks of gene expression in immune cells. Nat Immunol, 2008. 9(10): p. 1091-4.
  7. Shoemaker, J.E., et al., An Ultrasensitive Mechanism Regulates Influenza Virus-Induced Inflammation. PLoS Pathog, 2015. 11(6): p. e1004856.