(449g) A Systematic Procedure for Designing Training Data for Molecular Property Prediction
AIChE Annual Meeting
2018
2018 AIChE Annual Meeting
Engineering Sciences and Fundamentals
Data-Driven Screening of Chemical and Materials Space
Wednesday, October 31, 2018 - 9:30am to 9:45am
Organic material design requires thoroughly exploring chemical compound space to gain desired property information. The astronomical size of molecule space, however, makes it impossible to use experiments of quantum chemistry to evaluate every molecule in the space; consequently, data-driven semi-empirical models are required to calculate molecular properties rapidly. To this end, the focus of this talk is two-fold. First, we show that the group contributions approach can be generalized using a combination of Cheminformatics-based path fingerprints and sparse modeling techniques to derive an ab initio data-driven model to accurately estimate heats of atomization of small organic molecules, reaches the mean absolute error of 1.59 kcal/mol and 2.61 kcal/mol for QM7 and QM9 dataset excluding molecules with fused rings. Further, we show that modern experimental design tools and cheminformatics-based subset selection techniques can be combined to systematically minimize the amount of data needed to train a model. We specifically show that, given a set of molecules for which data-driven models are sought, a carefully chosen subset of even 2- 10% of original dataset is sufficient to train a model that is as accurate as one that is trained on >80% of the data.