(140d) Applying Data Science Techniques to Solubility Data for Synthetic Compounds: An Expedited End-to-End Workflow from Data Collection to Crystallization Process Design | AIChE

(140d) Applying Data Science Techniques to Solubility Data for Synthetic Compounds: An Expedited End-to-End Workflow from Data Collection to Crystallization Process Design


Huggins, S., Amgen Inc.
Crystallization is an important process often employed several times within synthetic routes for drug substances. The development of robust crystallization processes can be separated into two tasks: (1) the selection of a solvent-system, and (2) the design of a process within that solvent system that affords materials with desired chemical/physical purity and meets yield requirements. An end-to-end workflow relying on data science techniques was developed for capturing, visualizing, and interpreting solubility data to facilitate the consistent and rapid execution of these tasks.

This workflow starts with the collection of solubility data using standardized equipment sets and approaches into templated tables. These tables contextualize the solubility data by joining each measured value (concentration, temperature, composition, etc.) with relevant meta-data (solute purity, x-ray diffraction results, equipment, date, etc.). The contextualized solubility data is ingested within a database – providing a single source for all solubility data. A templated visualization then consumes the data from this source. Further it can be filtered within the visualization as necessary (e.g., limited to a specific solute and limited to only data collected for a specific lot of solute).

To facilitate solvent selection as the first task in crystallization process development, this visualization automatically applies a decision tree to collected solubility data to classify solvents with regards to crystallization as solvents that are likely: “good for a thermal process”, “solvent within antisolvent driven crystallization”, or “good antisolvents”. Based on this classification the scientist working with the system can either begin the task of process design or apply a predictive solubility model that has been integrated within the visualization to determine other solvent systems that may be worth investigating. The application of this model uses simple R scripts with open source libraries (e.g., non-linear optimization packages) doing the “heavy-lifting”.

Once a solvent system has been selected, the task of process design begins using an automated script to fit and select the best-model for solubility data across ranges of temperature/antisolvent ratios within that system. The contextualization of the solubility data allows for visual identification and rapid exclusion of outliers within the model fitting step. Once a solubility model has been selected and fit for a given system – a constrained optimization algorithm is applied to determine the process that affords the highest yield given user supplied constraints (which are modified to ensure the process meets desired chemical/physical purity needs). This initial crystallization process conditions are then attempted and refined as necessary.

This end-to-end workflow has resulted in significant time savings, and allowed for the setting of consistent expectations for initial designs across projects. Further, it is an early demonstration of the integration of modelling and data science techniques within process development.