(411h) Using Molecular Subgraph Libraries for High-Throughput Screening and the Inverse Design Problem

Austin, N. - Presenter, Carnegie Mellon University
An important set of molecular descriptors in cheminformatics and chemical engineering is the set of molecular subgraphs, sometimes referred to as signature descriptors1 and used to construct Morgan/circular fingerprints2. These molecular subgraphs are of an arbitrary height, K, meaning they contain atomic information about a central atom as well as all atoms within a distance of K bonds from the central atom. In this work, we detail a number of applications of subgraph representations of molecular structures in our computational chemistry software, AMS2019, as well as in the general context of high-throughput screening/inverse design.

First, we discuss the “representability” of large datasets using subgraphs. Specifically, we parse the PubChem database3 (~100 million compounds) into all possible subgraphs and then calculate the number of subgraphs required to fully represent 100%, 90%, 80%, etc. of the database using subgraphs of different (and mixed) heights. Next, we perform DFT calculations with ADF4 on the most common 100,000 subgraphs from PubChem and use this library of molecular fragments to generate sigma-profiles for use in high-throughput screening with COSMO-RS5 as well as in geometry initialization for accelerating DFT calculations. Finally, we discuss the “inverse design” problem: the problem of finding an optimal molecular structure(s) given molecular structure/property constraints and a design objective. We address the inverse design problem using state-of-the-art Mixed-Integer (Non)Linear Programming (MILP/MINLP) techniques6, exploiting problem structures inherent in subgraph representations of molecules. Specific applications to solvent, drug, and electronic materials design are discussed.

[1] Faulon, Jean-Loup, Donald P. Visco, and Ramdas S. Pophale. "The signature molecular descriptor. 1. Using extended valence sequences in QSAR and QSPR studies." Journal of chemical information and computer sciences 43.3 (2003): 707-720.

[2] Rogers, David, and Mathew Hahn. "Extended-connectivity fingerprints." Journal of chemical information and modeling 50.5 (2010): 742-754.

[3] PubChem Database. National Institute of Health. https://pubchem.ncbi.nlm.nih.gov/

[4] ADF2018, SCM, Theoretical Chemistry, Vrije Universiteit, Amsterdam, The Netherlands, http://www.scm.com.

[5] Klamt, Andreas, Volker Jonas, Thorsten Bürger, and John CW Lohrenz. "Refinement and parametrization of COSMO-RS." The Journal of Physical Chemistry A 102.26 (1998): 5074-5085.

[6] Austin, Nick D., Nikolaos V. Sahinidis, and Daniel W. Trahan. "Computer-aided molecular design: An introduction and review of tools, applications, and solution techniques." Chemical Engineering Research and Design 116 (2016): 2-26.