(2cu) Computer-Aided Molecular Design: Combining Knowledge-Based and Data-Driven Approaches | AIChE

(2cu) Computer-Aided Molecular Design: Combining Knowledge-Based and Data-Driven Approaches

Authors 

Lee, Y. S. - Presenter, Imperial College London
Research Interests

The discovery of new molecules, such as solvents, polymers, catalysts and pharmaceutical products, is vital in achieving high performance, including greater efficiency, favourable process economics, and low environmental impact. In this context, Computer-Aided Molecular Design (CAMD) has been put forward as a powerful and systematic technique that can accelerate the identification of new candidate molecules. Despite the significant advances made in the field of CAMD, there are remaining challenges in 1) handling the complexities arising from the large mixed-integer nonlinear quantitative structure-property relationship (QSPR) models; 2) considering the interdependency between molecular properties and overall process performance; 3) accounting for uncertainty and predictive accuracy of the QSPR; and 4) ensuring the synthesizability of the molecules newly identified. My research interest is centred around to address those challenges with the aim of accelerating the identification of promising molecular candidates.

Keyword: computer-aided molecular design, integrated molecular and process design, mixed-integer nonlinear programming, data-driven approaches, deep learning in molecular design.

Integrated molecular and process design: handling numerical challenges

In many cases, the overall performance of the molecule of interest is realised by evaluating the molecular properties within the process context. However, the formulation of the integrated molecular and process design problems, which is referred to as computer-aided molecular and process design (CAMPD), often results in significant numerical complexities. This is mainly because 1) the relationship between the process and molecular property models exhibits a highly nonlinear behaviour, and thus it is usually prohibitively expensive to solve such models simultaneously; 2) the integrated molecule-process model is characterised by the number of infeasible regions in the search space, as it is not possible to generate a feasible solution for particular molecular structures. Thus, developing a robust CAMPD algorithm that allows one to avoid infeasibilities during the exploration of a large design space is significantly important. During my PhD, as a part of the ROLINCAP project (funded by EPSRC), I have developed a robust CAMPD algorithm that can simultaneously optimise molecular and process variables without oversimplifying the original formulation of the problem. The significant improvement in the convergence behaviour has been achieved by incorporating tailored feasibility tests such that infeasible process conditions and solvent properties are automatically detected. The efficiency of the proposed algorithm has been demonstrated by applying it to the design of CO2 chemical absorption processes. The algorithm has been found to converge successfully in all 150 runs carried out. The application of the algorithms can be readily extended to other types of CAMPD problems, which is likely to accelerate the discovery of new processing materials.

Automated molecular design: reacting solvent design and accounting for uncertainty

In any CAM(P)D framework, one of the essential elements in the development of thermodynamic methodologies that can provide predictions of fluid phase behaviours and physical properties of (unknown) molecules. Therefore, the successful application of CAM(P)D techniques is often limited by the prediction accuracy of the physical property models. Yet, the development of the knowledge-based model involves the simulation of physics-based models or experimental study, which can result in prohibitively high costs of resources. Furthermore, many of the QSPR models widely used in CAM(P)D are subject to significant uncertainty due to the fact that only the small size of property data is used for correlating the molecular features and the properties, while the very large size of the chemical space is explored. In this context, the machine-learning (ML) techniques have offered an alternative route of yielding computationally less expensive and accurate property prediction and being integrated into molecular design frameworks for the discovery of novel drugs or materials. My postdoctoral research, as a part of PharmaSEL-Prosperity work package 1: Drug substance synthesis (funded by Eli Lily), has focused on the development of data-driven frameworks CAMD in the presence of uncertainties aiming at identifying the promising solvents for a given chemical reaction (e.g., classic SN2 reaction, Menshutkin reaction). Various types of deep generative models, such as variational autoencoder, generative adversarial networks and graph neural networks, combined with the string-based and/or graph molecular representation, have been introduced as well as the hybrid of the ML and deterministic model. Given the sparsity of the kinetic data, Bayesian optimisation in conjunction with the Gaussian process regression and deep neural network model has been performed in which the new data points are obtained QM calculations at each iteration. The performance of each algorithmic option has been systematically compared with the traditional QM-CAMD method to provide initial guidance on the applicability and reliability of the approaches.

Teaching interests

As an active number of the Process Systems Engineering community, I have developed my teaching and leadership skills by supervising research projects in diverse areas and delivering lectures for both undergraduate (UG) students and industrial audiences.

During my academic studies at Imperial College London, I have successfully supervised undergraduate and MSc research projects (10 MEng students, 2 MSc students). The research topics include: Integrated working fluid and Organic Rankine Cycle (ORC) processes, optimal mixture design of working fluid and ORC processes, superstructure design of ORC processes, solvent design for CO2 chemical absorption processes, and data-driven surfactant design, where the broad range of understanding in numeral modelling of the chemical process, thermodynamic and phase behaviour of the fluids, and mathematical optimisation methods is required.

As a research supervisor, I have helped students equip essential skills required for conducting the research independently. I have also learned how to motivate and encourage students to clarify ideas by asking probing questions.

Based on the five years of working experience in the oil and gas industries, in combination with my academic background, I have delivered or formulated lectures as a part of the graduate teaching assistant program in the following modules: Dynamic behaviour of process systems (4th year UG, teaching assistant), Thermodynamics (2nd year UG, teaching assistant), Strategy of process design (3rd year UG, teaching assistant), ASPEN tutorial (3rd year UG, lecturer), Designing molecular systems for sustainability (MSc, lecturer), column simulation & sizing (MSc/PhD/industrial partners, lecturer), Advanced Refinery Process Design (industrial partners, lecturer).

As a teaching assistant/lecturer, I have established my teaching ability in creating a positive learning environment and explaining complex and abstract concepts to the audience from a multi-disciplinary background. As a result, I have received positive feedback from students in the Student Online Evaluation surveys.

In addition to the courses aforementioned, I would like to better integrate new development in deterministic optimisation approaches and ML techniques in application to practical chemical engineering problems. In the course, I would like to offer an opportunity to learn how to combine theoretical methods of such optimisation methods with hands-on programming skills, which are critical to succeed in industrial/research projects.