(346bf) Combining Strategic Training Data Selection and Feature Engineering to Reach Accurate and Efficient Molecular Property Prediction
AIChE Annual Meeting
2020
2020 Virtual AIChE Annual Meeting
Computational Molecular Science and Engineering Forum
Poster Session: Computational Molecular Science and Engineering Forum (CoMSEF)
Wednesday, November 18, 2020 - 8:00am to 9:00am
Organic molecular design problems, such as drug discovery or material design, aim to identify molecules with desired properties from the chemical space, wherein the number of potential compounds is estimated to reach 1060. The size of the chemical space forbids experiments or high-level quantum chemistry to evaluate each molecule. In recent decades, the integration of machine learning methods with virtual screening makes the exploration of chemical space practical due to its high efficiency and low cost. While many machine learning models manage to reach high accuracy with hundreds of thousands of training molecules, only a handful of study has been focused on optimizing the model performance under a tight computation budget. In this work, we propose a strategy to obtain accurate machine learning predictions with a minimum number of data points required for training. Specifically, we address the problem in threefold. First, we demonstrate the efficacy of a method that adaptively builds the compact training set by systematically balancing exploitation via experimental design and exploration of the space via cheminformatics-based diversity maximization procedures. Second, we expand this procedure with the use of nonlinear and locally linear dimensionality reduction methods to leverage data embeddings. Third, we focus on improving the model accuracy under the constraint of a small training set, which we achieve by progressively incorporating nonlinearity to our modified group additivity approach.