(161i) Machine Learning for Molecular Property Predictions and the Software Ecosystem That Enables It | AIChE

(161i) Machine Learning for Molecular Property Predictions and the Software Ecosystem That Enables It


Hachmann, J. - Presenter, University at Buffalo, SUNY
The process of creating new chemistry and materials is increasingly driven by computational modeling and simulation, which allow us to characterize compounds of interest before pursuing them in the laboratory. However, traditional physics-based approaches (such as first-principles quantum chemistry) tend to be computationally demanding, in which case they may not be a practically viable option for large-scale screening studies that could efficiently explore the vastness of chemical space.

In this presentation, we will show how we employ machine learning to develop data-derived prediction models that are alternatives to physics-based models, and how we utilize them in massive-scale hyperscreening studies at a fraction of the cost. Aside from conducting such data-driven discovery, we also employ data mining techniques to develop an understanding of the hidden structure-property relationships that define the behavior of molecules, materials, and reactions. These insights form our foundation for the rational design and inverse engineering of novel compounds with tailored properties.

In this presentation we will discuss the progress on our software ecosystem for data-driven in silico research that enables data-driven research, both on the application as well as on the method development side. It consists of four loosely connected program suites: ChemLG is a generator for compound and material candidate libraries that allows us to enumerate chemical space (i.e., performing data definition); ChemHTPS provides an automated platform for the virtual high-throughput screening of these libraries (i.e., performing data generation); ChemBDDB offers a database and data model template for the massive information volumes created by data-intensive projects (i.e., performing data storage); and ChemML is a machine learning and informatics toolbox for the validation, analysis, mining, and modeling of such data sets.

The notion to utilize modern data science in chemistry is so recent that much of the basic infrastructure has not yet been developed, or is still in its infancy. The existing tools and expertise tend to be in-house, specialized, or otherwise unavailable to the community at large. Data science is thus in practice beyond the scope and reach of most researchers in the field. By contributing this open, general-purpose, comprehensive, easy-to-use software ecosystem, we aim to chart new paths in this area and help in overcoming this situation, filling the prevalent infrastructure gap, and thus making data-driven research a viable and widely accessible proposition for the community.