(38b) Dealing with Small Data in Biopharmaceutical Batch Process Monitoring: A Machine-Learning Approach

Tulsyan, A., AMGEN
Garvin, C., Amgen Inc.
Undey, C., Amgen
In recent years, biotechnology-based products have been gaining increasing visibility and success in treating chronic diseases, such as arthritis, diabetes and cancer. As per the report [1] published in 2005, approximately 27% of the new medicines in active development are in fact biotechnology-based products. Another report published in 2013 by the United States Food and Drug Administration (US FDA) states that a growing percentage, now over 40%, of all pharmaceutical industry research are now biopharmaceutical rather than classical drugs.

To efficiently monitor and control biopharmaceutical processes, multivariate statistical techniques are commonly deployed for batch process monitoring (BPM). A BPM framework uses multivariate statistical models (e.g., principal component analysis (PCA) and partial least squares (PLS)) to capture common-cause variations in the batch [2]. The control charts (e.g., Hotelling T2 and the squared prediction error (SPE) statistics) and control limits are then used to determine whether a new batch demonstrates normal operating behavior or not. An alarm is raised if a batch is statistically different from the normal operations.

Despite over two decades of research in BPM, the biopharmaceutical BPM framework suffers from a unique challenge the Low-N problem (or small-data problem). The Low-N problem represents a scenario where a product has a limited production history, denoted here by N. It is common for companies to have only one or two runs for a new drug product at a manufacturing facility. While a new product may require a limited number of runs to meet clinical or early commercial demand, it also creates a Low-N scenario for the product. In terms of BPM, a Low-N scenario poses several challenges. First, under the Low-N, it is nontrivial to capture the common-cause variations in its entirety. Second, the predictive capabilities of PCA and PLS are less accurate under the Low-N scenario. Further, under the Low-N scenario, model over-fitting becomes much harder to avoid and the effects of outliers are much more pronounced.

The Low-N problem is a longstanding, industry-wide problem in biopharmaceutical manufacturing that challenges the theoretical foundations and practical applicability of the existing BPM platform. We propose an approach to transition from a Low-N scenario to a Large-N scenario by generating an arbitrarily large number of in silico batch data sets. The proposed method is a combination of hardware exploitation and algorithm development. To this effect, we propose a block-learning method for a Bayesian non-parametric model of a batch process, and then use probabilistic programming to generate an arbitrarily large number of dynamic in silico campaign data sets. The proposed solution not only alleviates the monitoring issues associated with a Low-N scenario, it is also compatible with the industrial BPM framework. To the best of authors knowledge, this is the first method that describes a systematic approach to address the small data problem using the tools for big data. The efficacy of the proposed solution is elucidated on an industrial biopharmaceutical process.


[1]. Hamilton G. “The Biotechnology Market Outlook: Growth Opportunities and Effective Strategies for Licensing and Collaboration”. Dublin: Research and Markets. 2005.

[2]. P. Nomikos and J. F. MacGregor, “Monitoring batch processes using multiway principal component analysis, AIChE Journal, vol. 40, no. 8, pp. 1361–1375, 1994.