(548b) Cross-Tissue Drug Signature Predictions for Drug Repurposing | AIChE

(548b) Cross-Tissue Drug Signature Predictions for Drug Repurposing

Authors 

Chrysinas, P. - Presenter, University at Buffalo
Gunawan, R., SUNY Buffalo
Drug development is a lengthy and cost-intensive process that has an extremely low rate of success (DiMasi et al., 2016). For rare diseases, the only avenue to finding treatment is by drug repurposing—finding a new therapeutic use of drugs that differs from their original indications (Pushpakom et al., 2019). Data-driven strategies play an important role in drug repurposing for mining and integrating literature, knowledge base and omics (e.g., transcriptome) data to match and prioritize drugs to diseases. But, data-intensive approaches require an abundance of cellular signatures of drugs, preferably from the specific human tissue(s) affected by the disease. Recent community efforts have produced large-scale human transcriptome datasets from drug and chemical perturbations as well as from post-mortem human tissues. A notable dataset is the Connectivity Map (CMap) from NIH Library of Integrated Network-based Cellular Signatures (NIH LINCS) that comprises more than 1.5 million drug signatures across 71 different cell lines and for 20,000 drugs (Subramanian et al., 2017). Although impressive in size, these signatures are taken from immortalized cancer cell lines due to challenges in obtaining and working with primary cells from most human tissues. Unfortunately, cancer cells commonly exhibit abnormalities in drug responses from normal human tissues.

In this work, we developed computational algorithms for predicting transcriptome signatures of drugs in target cell line(s) using data from other (source) cell lines. As illustrated in Figure 1A, we considered an imputation problem to generate in silico drug transcriptional signatures, in which the drug transcriptional data of the source cell lines (in blue) and the background data of the target cell lines (in green) do not have any overlapping conditions. In the end application of our algorithm, the source dataset will come from CMap LINCS L1000 (drug-induced cell-specific transcriptomic data) while the background data for the target cell line will be compiled from the literature and other resources (e.g., Genotype Tissue Expression (GTEx)). To this end, we created a two-step imputation method (see Figure 1B). The first step involved a simple averaging or a regression method to compute in silico mean transcriptional signature for a drug at a specific drug load using the source dataset. The drug load (DL) was evaluated by multiplying the drug concentration with the duration of treatment (i.e., the time point at which the cell sample was collected). The simple average (DL mean) produces the in silico mean expression value for each gene by averaging the available values for the drug and the particular drug load from the source cell lines. Meanwhile, in the regression method, all data of the drug from the source cell lines are used to build a linear regression model for each gene, where the gene expression value is the dependent variable and the drug load is the independent variable. In the second step, we projected the in silico mean transcriptional profiles from the first step onto the latent space of the target cell line that is reconstructed using the background data of the target cell line. In this work, we demonstrated the performance of our proposed strategy using PCA (principal component analysis) projection. Since PCA is not able to capture possible nonlinearity in the latent space, for each drug and drug load, we identified the top ntranscriptional profiles (default n = 5) that are closest to the in silico mean gene expression, to be used for the PCA latent space projection.

We evaluated the performance of our proposed methods using 9 source cell lines from the CMap dataset (MCF7, VCAP, A375, A549, PC3, HA1E, HT29, HCC515 and HEPG2) to impute drug transcriptional signature of the target cell line NPC (nasopharyngeal carcinoma cells) also from the CMap dataset. We focused on the top 100 drugs with the most data (i.e. number of samples) in the source cell lines above. Among these 100 drugs, 18 (total of 32 drug load samples) were also found in the samples available for the target cell line, which we excluded from the background dataset. Instead, we used the excluded drug signature data for the purpose of computing the accuracy of our imputation (i.e., test data). For performance evaluation, we evaluated the area under receiver operating characteristics (AUROC) and the area under precision-recall curve (AUPR). The performance metrics were computed by comparing the up- and down-regulated genes between the in silicoand test data. Here, we used a threshold of 1 sigma (standard deviation) to define the up- and down-regulated genes. Finally, we compared our strategy against an imputation method called the tensor-train weighted optimization (TT-WOPT) (Iwata et al., 2019). The formulation of our imputation problem in Figure 1A does not permit applying a majority of existing imputation strategies, as they commonly require an overlap in the data matrices.

The violin plots in Figure 1C show the AUROC and AUPR of our proposed strategies and TT-WOPT. In comparison to a random predictor, our methods have AUPRs and AUROCs that are significantly higher (p-value < 10-4). Comparing the results of the first step (DL mean and Regression) with those from the second step (DL mean + PCA and Regression + PCA) indicates a significant improvement in imputation accuracy conferred by the PCA projection (average AUPR/AUROC from 1st step = 0.383/0.681 vs. average AUPR/AUROC from 2nd step = 0.535/0.780; p-value < 10-4). The DL mean and Regression methods showed comparable accuracy, but the Regression method is able to impute in silico data for drug load points that are not part of the source dataset. On the other hand, TT-WOPT did not perform significantly better than a random predictor. In fact, for many drugs and drug loads, TT-WOPT had lower accuracy than a random predictor.

In summary, the proposed DL Mean and Regression method, combined with PCA projection to the latent space of the target cell line, represent a promising approach for drug transcriptome imputation for cells and/or tissues that are not easily accessible and more generally for cross-tissue gene expression prediction. The improvement in performance brought by the latent space projection method motivates future work using advanced deep learning algorithms that are able to handle possible nonlinearity of the latent space. Moreover, tissue-specific gene regulatory networks offer additional information, especially in the case when background data for the target cell line are scarce.

References

DiMasi J A, Grabowski HG, Hansen RW. (2016) Innovation in the pharmaceutical industry: New estimates of R&D costs. J Health Econ. 47: 20-33.

Iwata M, Yuan L, Zhao Q, Tabei Y, Berenger F, Sawada R, Akiyoshi S, Hamano M, Yamanishi Y. (2019) Predicting drug-induced transcriptome responses of a wide range of human cell lines by a novel tensor-train decomposition algorithm. Bioinformatics 35(14): i191–i199.

Pushpakom S, Iorio F, Eyers PA, Escott KJ, Hopper S, Wells A, Doig A, Guilliams T, Latimer J, McNamee C, Norris A, Sanseau P, Cavalla D, Pirmohamed M. (2019) Drug repurposing: progress, challenges and recommendations. Nat Rev Drug Discov. 18(1): 41-58.

Subramanian A, Narayan R, Corsello SM, Peck DD, Natoli TE, Lu X, Gould J, Davis JF, Tubelli AA, Asiedu JK, Lahr DL, Hirschman JE, Liu Z, Donahue M, Julian B, Khan M, Wadden D, Smith IC, Lam D, Liberzon A, Toder C, Bagul M, Orzechowski M, Enache OM, Piccioni F, Johnson SA, Lyons NJ, Berger AH, Shamji AF, Brooks AN, Vrcic A, Flynn C, Rosains J, Takeda DY, Hu R, Davison D, Lamb J, Ardlie K, Hogstrom L, Greenside P, Gray NS, Clemons PA, Silver S, Wu X, Zhao WN, Read-Button W, Wu X, Haggarty SJ, Ronco LV, Boehm JS, Schreiber SL, Doench JG, Bittker JA, Root DE, Wong B, Golub TR. (2017) A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell 171(6): 1437-1452.e17.

Figure Label

Figure 1.

Figure 1. A. Overview of the in silico drug signature imputation. B. Schematic diagram of in silico drug signature imputation. DL Mean method uses the average of the gene expression values from the source cell lines for each drug at a given drug load. Regression method uses Ordinary Least Squares (OLS) to produce a linear regression model for the in silico gene expression using all expression data for the drug from the source cell lines. In step 2, we applied the Principal Component Analysis (PCA) using the top n nearest neighbors to define the gene expression latent space of the target cell line, and projected the in silico gene expression from the first step onto the latent space. C. Performance evaluation for NPC cell line. Left (Right) column: Violin plots for the case of the upregulated (downregulated) genes for the AUPRs and AUROCs.