(192j) Generating Molecules with Optimized Solubility Using Iterative Graph Translation | AIChE

(192j) Generating Molecules with Optimized Solubility Using Iterative Graph Translation

Authors 

Mukhopadhyay, S., The Dow Chemical Company
Emerson, J., Dow Chemical Company
Xu, H., The Dow Chemical Company
Jin, W., Massachusetts Institute of Technology
Barzilay, R., Massachusetts Institute of Technology
Jensen, K. F., Massachusetts Institute of Technology
Molecular discovery is key to solving many problems in chemical engineering, ranging from identifying cures for infectious diseases to developing technologies that address climate change. While there exists a large number of theoretically accessible molecules, the time and resource costs of experiments make it intractable to fully explore chemical space. To this end, we present a generative modeling framework that is capable of producing novel molecules that are optimized with respect to multiple objectives or constraints. Our approach involves training a Hierarchical Graph Neural Network (HGNN) to translate a given molecule into an improved one by providing training pairs of “less optimal” and “more optimal” molecules. Once the model is trained, it can then be used to translate all molecules in the dataset into improved molecules. These improved molecules are then added to the training data and used to retrain the translation model. In this way, we can iteratively train and translate in order to 1) build a robust translator that can be applied to a given set of candidate molecules and 2) discover new, highly optimized molecules for a given application.

In this talk, we will present our work applying this framework to the specific problem of designing molecules to optimize aqueous solubility. Aqueous solubility is an important property for many chemical applications ranging from drug design to climate prediction. We successfully trained a translator to improve aqueous solubility and found that the model was also capable of discovering molecules that were more soluble than any of the training examples. When we applied synthetic feasibility as a secondary optimization constraint, the resulting model generated synthetically feasible molecules 93.2% of the time. Additionally, we investigated the role that training dataset size plays in model performance and found that reasonable models could be trained on datasets containing only 102-103 molecules. This workflow serves as a general approach for generating molecules that are both optimized and synthetically feasible. These promising results have led us to explore how this framework can be applied to solve a variety of molecular design problems including designing novel dyes, enantioselective catalysts, and soluble drugs.