(203j) Virtual High-Throughput Screening Pipeline: Size and Classification Distribution Effects on Experimentally Validated Hit-Rates

Chen, J. J., University of Notre Dame
Visco, D. P. Jr., University of Akron
Schmucker, L., The University of Akron
Drug leads are a small fraction of small organic compounds, which itself is a small fraction of conceivable organic molecules. High-Throughput screening (HTS) is used to screen multitudes of compounds in parallel (test more compounds at the same time) and in experimental working volumes (reduce the amount of reagents and biological substances used). However, HTS has a nominal hit-rate of a few percent at best (Dobson, 2004) and thus most resources are used to test inactive compounds. In efforts to increase efficiency, computational methods are used to pre-screen compound libraries: likely active molecules are focused on and likely inactive or negative molecules are removed.

Previously, we developed a virtual HTS (vHTS) pipeline: using available experimental data, models predicting class and activity are trained and applied to compound databases to identify compounds likely to be active and to focus experimental efforts on (Chen and Visco, 2016), with the option of retraining the models to improve performance (hit-rate) or to identify more leads. The pipeline was applied several NCBI’s PubChem Bioassay datasets: AID 825 (target: Cathepsin L, 1st iteration hit-rate: 19%, 2nd iteration hit-rate: 75%) (Chen and Visco, 2016), AID 728 (target: Factor XIIa, 1st iteration hit-rate: 43%, 2nd iteration hit-rate: 100%) (Chen and Visco), and AID 846 (target: Factor Xia, 1st hit-rate: 27%, 2nditeration hit-rate: 62%) (Chen and Visco).

To determine the hit-rate enrichment ability of the pipeline, we have applied our pipeline on more datasets, specifically targeted to examine how the pipeline responds to datasets of different sizes and classification distributions. Determining the effect of these two dataset parameters will indicate what datasets the pipeline has the most enrichment value on and/or what the expected enrichment ability is for a given dataset.

In this poster, we present work characterizing the effects of size and classification distribution. While controlling for one parameter (size or classification distribution), the other is varied. Experimental validation is conducted based on the protocol specified by the original dataset to determine vHTS hit-rates. The hit-rates determined using the same experimental protocol will allow for direct comparison of hit-rates and to identify the hit-rate enrichment ability of the pipeline for a given dataset. Based on the results, there will be a clearer idea what datasets will have the most impact on and what an expected hit-rate for a given dataset would be. We also aim to show, indirectly, the pipeline’s hit-rate enrichment ability is repeatable and the pipeline is robust enough to handle many different kinds of datasets.

Chen J.J.F., Visco D.P. Jr., Developing an in silico pipeline for faster drug candidate discovery: Virtual high throughput screening with the Signature molecular descriptor using support vector machine models, Chemical Engineering Science, 2 March 2016. http://dx.doi.org/10.1016/j.ces.2016.02.037.

Dobson, C.M., 2004. Chemical space and biology. Nature 432, 824-828.

Chen, J.J.F, Visco, D.P. Jr. Identifying Novel Factor XIIa Inhibitors With PCA-GA-SVM Developed vHTS Models. Manuscript in preparation.

Chen, J.J.F, Visco, D.P. Jr. Identifying Novel Factor XIa Inhibitors With PCA-GA-SVM Developed vHTS Models. Manuscript in preparation.