(186e) Evaluating Polymer Stabilizer Performance Using Molecular Descriptors and Machine Learning on a Small Dataset

Liu, A. - Presenter, Georgia Tech
Venkatesh, R., Georgia Institute of Technology
McBride, M., Georgia Tech
Grover, M., Georgia Tech
Mining experimental data from the literature is an important exercise for informing future experimental studies, even if available data is sparse. However, extracting insights from small materials datasets, such as those within research papers and patents, is inherently challenging due to high complexity, high dimensionality, and heterogeneous reporting across sources. This case study presents a situation where judicious molecular representation, feature importance, and physicochemical interpretation were integrated to extract machine learning insights on a small dataset. Here, experimental data from a single patent was analyzed to learn from the small molecule additives that were most effective in mitigating the degradation of poly(ethylene terephthalate) (PET). MACCS-166 and alvaDesc molecular descriptors were calculated for the dataset of 39 additive candidates to yield two sets of 166 and 1875 different features, respectively. Performing k­-means clustering using these molecular descriptors revealed evidence that performance differences were sensitive to variations in molecular structure. To pinpoint the features responsible for improved performance, a supervised reduced design region approach was applied to analyze descriptors both individually and in multiple dimensions to determine effectiveness in a binary classification of high and low performance. Not only were the most influential descriptors justifiable with respect to degradation chemistry, but also the selected features successfully trained random forest models with good cross validated performance. In comparing molecular descriptor approaches, we find that judicious interpretation of underlying physicochemical behavior is indispensable in validating the effectiveness of small data machine learning, especially for prioritizing experimental work toward a richer dataset.