(299e) Solubility Data Mining and Predictive Modeling: AI+ChE | AIChE

(299e) Solubility Data Mining and Predictive Modeling: AI+ChE


Albrecht, J. - Presenter, Bristol-Myers Squibb
Qiu, J., Bristol-Myers Squibb Co.
In the synthesis of pharmaceutical drug substances, selecting the optimal solvent system as early as possible is critical to develop efficient, commercially viable processes. Process safety, environmental impact, and yield are driven in large part by the choice of solvent system. Yet early in process development, material availability limits the amount of experimental data that can be collected. Models to predict solubility enable more focused experimentation, but with current approaches there are trade-offs associated with accuracy, solute characterization, and accessibility for process development engineers.

Performing a historical analysis of experimental data using the tools of data scientists has the potential to unlock insights and streamline future development work. Applying these tools to solubility prediction, data was mined from an internal catalog of automated solubility screening reports across dozens of projects. In all, over 64,000 solubility measurements for >700 pharmaceutically relevant organic compounds in pure and mixed solvents at various temperatures have been aggregated from these reports and analyzed. Such a data set allows for the rapid testing of hypotheses related to solvent selection, such as correlations and synergies between solvent pairs and temperature effects.

Access to this large data set enables machine learning approaches to create quantitative structure property relationship models for solubility prediction that show an improvement over the current standard approaches. Deployed web applications can allow any researcher to rapidly calculate solubility predictions by providing only a molecular structure and a single benchmark solubility measurement. This presentation aims to show how the recently available combination of algorithms (random forests, support vector machines, neural networks, etc.), data-oriented programming languages (R, Python), and cloud computing capabilities can enable chemical engineers to better utilize their data for pharmaceutical process development.