(255a) Comparison of Machine Learning Approaches for Process Model Development from Big Data

Authors: 
Davis, S., Auburn University
Cremaschi, S., Auburn University
Eden, M. R., Auburn University
Over the last decade, the development of advanced sensing technologies coupled with powerful computing tools and modern automation infrastructure has tremendously increased the amount and variety of data collected in laboratory and industrial settings. For example, basic process control systems (BPCS), which are designed to keep the processing variables within their normal operating ranges, contain databases with longitudinal information for all process variables along with all recorded alarms. Massive amounts of data, which describe physical and chemical properties of process variables, are recorded. To extract useful and relevant information from these vast amounts of data, data analytics and machine learning techniques are gradually penetrating the process and pharmaceutical industries. They are used to improve the understanding of the process, to develop surrogate process models, for fault-diagnosis, and to learn about the health and stability of the overall process (e.g., [1-4]).

Surrogate process models, which are also known as metamodels or emulators, in this context, are approximate models that are constructed to statistically relate a set of input process variables to a set of output process variables. There are a number of machine learning techniques that can be used to construct surrogate models such as Extreme Learning Machines (ELMs) [5], Artificial Neural Networks (ANNs) [6], and Automated Learning of Algebraic Models using Optimization (ALAMO) [7], but little work has been done to systematically compare their ability to learn the response of complicated models with different characteristics, such as those generated by chemical and pharmaceutical processes.

This study compares eight surrogate-model construction approaches using computational experiments. The construction approaches considered include: ANNs, ALAMO, Radial Basis Networks (RBNs) [8], ELMs, Gaussian Progress Regression (GPR) [9], Random Forests (RFs) [10], Support Vector Regression (SVR) [11], and Multivariate Adaptive Regression Splines (MARS) [12]. Each approach is used to construct surrogate models for predicting the outputs of thirty-four test functions, which can be found in the Virtual Library of Simulation Experiments (https://www.sfu.ca/~ssurjano/optimization.html), and that have with various shapes and numbers of inputs. The input-output data that is employed for training the surrogate models is generated using Latin Hypercube, Sobol and Halton sampling methods.

The performance of the surrogate models for each challenge function were compared using maximum absolute error (MAE) and root mean squared error (RMSE). The results revealed that, at large sample sizes, the sampling method applied to generate the training data set did not have a statistically significant impact on the performance measures. However, when the results were examined in groups constructed based on the number of inputs and shape of the test functions, the surrogate-models constructed using ANN, ALAMO and ELM yielded smaller MAE and RMSE than the other surrogate-model construction approaches. It is also worth noting that the models constructed using ALAMO had consistently simpler functional forms that ANN and ELM models.

1. Machin, M., L. Liesum, and A. Peinado, Implementation of modeling approaches in the QbD framework: examples from the Novartis experience. Eur Pharm Rev, 2011. 16: p. 39-42.

2. Kirdar, A.O., et al., Application of near-infrared (NIR) spectroscopy for screening of raw materials used in the cell culture medium for the production of a recombinant therapeutic protein. Biotechnology Progress, 2010. 26(2): p. 527-531.

3. Kirdar, A.O., et al., Application of Multivariate Analysis toward Biotech Processes: Case Study of a Cell-Culture Unit Operation. Biotechnology Progress, 2007. 23(1): p. 61-67.

4. Kirdar, A.O., K.D. Green, and A.S. Rathore, Application of Multivariate Data Analysis for Identification and Successful Resolution of a Root Cause for a Bioprocessing Application. Biotechnology Progress, 2008. 24(3): p. 720-726.

5. Huang, G.-B., D.H. Wang, and Y. Lan, Extreme learning machines: a survey. International Journal of Machine Learning and Cybernetics, 2011. 2(2): p. 107-122.

6. Haykin, S., Neural networks and Learning Machines. 3 ed. 2009: Prentice Hall.

7. Cozad, A., N.V. Sahinidis, and D.C. Miller, Learning surrogate models for simulation-based optimization. AIChE Journal, 2014. 60(6): p. 2211-2227.

8. Park, J. and I.W. Sandberg, Universal approximation using radial-basis-function networks. Neural computation, 1991. 3(2): p. 246-257.

9. Rasmussen, C.E., Gaussian processes for machine learning. 2006.

10. Breiman, L., Random forests. Machine learning, 2001. 45(1): p. 5-32.

11. Basak, D., S. Pal, and D.C. Patranabis, Support vector regression. Neural Information Processing-Letters and Reviews, 2007. 11(10): p. 203-224.

12. Friedman, J.H., Multivariate adaptive regression splines. The annals of statistics, 1991: p. 1-67.