(233ar) Virtual High-Throughput Screening Pipeline: Limitation Characterization and Application to Different Targets

Chen, J. J., University of Notre Dame
Visco, D. P. Jr., University of Akron
Drug leads comprise a small portion of all known compounds. In an effort to identify promising leads, thousands to hundred thousands of compounds are experimentally tested for biological activity in high-throughput screening (HTS) experiments. However, typical hit-rates are only a few percent at best (Dobson, 2004). To increase efficiency, researchers are using previously obtained data and computational techniques to identify probable leads to focus resources on.

Previously, we developed an iterative virtual HTS (vHTS) pipeline. Using previously experimental data as input, the pipeline will develop quantitative structure-activity relationships (QSAR) models capable of classifying a compoundâ??s class (active or inactive) and predicting its quantitative activity (Chen and Visco, 2016). Applied to a compound database, we can screen for probable leads. We tested our pipeline using NCBIâ??s PubChem Bioassay 825 dataset, targeting a receptor implicated in viral diseases. We experimentally confirmed predictions from 2 iterations, with a hit rate of 19% in the 1st iteration and 75% in the 2nd.

To expand on our work, we will examine three key parameters to determine the robustness and applicability of our pipeline. Two parameters are related to the experimental datasets used to train our models: size and distribution. Size refers to the number of entries in the dataset while distribution refers to the classification of each entry and the number of members in each class. These two parameters together dictate the amount and quality of information available for training our models. Determining limitations for these two parameters will further direct experimental efforts towards fruitful pursuits as well as increase prediction confidence. The third examined parameter is pipeline iteration. By iterating, training set data can be updated with the addition of data from the validation experiments. However, there are diminishing returns with additional iterations. Identifying when it happens will increase efficiency.

In this poster, we will present work towards determining the aforementioned limitations. From PubChemâ??s Bioassay database, we have identified datasets that are useful to determining the effect of each parameter. While controlling for two parameters, we examine the effect of the third by apply our pipeline to these datasets and experimentally validate predictions. Based on the observed changes in prediction accuracy and efficiency, we have a clearer idea what limitations exist and what effect a particular parameter has. We present the results of our work and aim to show our success is repeatable and our pipeline is robust enough to handle different datasets targeting different receptors while also investigating its limitations.

Chen J.J.F., Visco D.P. Jr., Developing an in silico pipeline for faster drug candidate discovery: Virtual high throughput screening with the Signature molecular descriptor using support vector machine models, Chemical Engineering Science, 2 March 2016. http://dx.doi.org/10.1016/j.ces.2016.02.037.

Dobson, C.M., 2004. Chemical space and biology. Nature 432, 824-828.