(449g) A Systematic Procedure for Designing Training Data for Molecular Property Prediction | AIChE

(449g) A Systematic Procedure for Designing Training Data for Molecular Property Prediction

Authors 

Li, B. - Presenter, Lehigh University
Rangarajan, S., Lehigh University - Dept of Chem & Biomolecular
Organic material design requires thoroughly exploring chemical compound space to gain desired property information. The astronomical size of molecule space, however, makes it impossible to use experiments of quantum chemistry to evaluate every molecule in the space; consequently, data-driven semi-empirical models are required to calculate molecular properties rapidly. To this end, the focus of this talk is two-fold. First, we show that the group contributions approach can be generalized using a combination of Cheminformatics-based path fingerprints and sparse modeling techniques to derive an ab initio data-driven model to accurately estimate heats of atomization of small organic molecules, reaches the mean absolute error of 1.59 kcal/mol and 2.61 kcal/mol for QM7 and QM9 dataset excluding molecules with fused rings. Further, we show that modern experimental design tools and cheminformatics-based subset selection techniques can be combined to systematically minimize the amount of data needed to train a model. We specifically show that, given a set of molecules for which data-driven models are sought, a carefully chosen subset of even 2- 10% of original dataset is sufficient to train a model that is as accurate as one that is trained on >80% of the data.