(593f) Nonlinear Programming Based Methods for PCA Parameter Estimation Using Data with Missing Elements | AIChE

(593f) Nonlinear Programming Based Methods for PCA Parameter Estimation Using Data with Missing Elements

Authors 

López-Negrete de la Fuente, R. - Presenter, Carnegie Mellon University
Biegler, L. - Presenter, Carnegie Mellon University

Abstract

Many chemical and pharmaceutical products are produced in plants where there is a large amount of sensors to extract data for analysis. In such cases, where large amounts of possibly correlated data is available for analysis, Principal Component Analysis (PCA) as well as other multivariate methods are commonly used. PCA is a multivariate technique in which a large number of related variables is transformed into a possibly smaller number of uncorrelated variables.1
Unfortunately, there can be cases in which there are missing values in the data used for analysis. This could be due to failing sensors, experiments used to obtain the data could be prohibitely expensive, or in the case when historical data is being used it may get lost.2In these situations different methods for obtaining the model parameters can be used, mainly a modified version of the Nonlinear Iterative Partial Least Squares (NIPALS) algorithm or the Expectation Maximization (EM) algorithm.3The modified NIPALS is the preferred method of the two because it is faster. However, there are some caveats associated with its use: the obtained loadings are not orthogonal and the variance of the score vectors might be greater than the variance of the original data (after mean centering and auto scaling). To address this issues the EM algorithm introduces the use of imputed values for the missing data. Thus, it solves full data problems using NIPALS at each iteration, then the imputed values are updated with the predictions, and this is repeated until the change in the imputed values is less than a user defined tolerance.4This method works well, but it is not very popular because it takes a long time to converge. In the open literature work has been done to analyze the effects of the amounts of missing data and its structure on the PCA models built using different methods.5,6Also, once a model has been constructed it can be used for prediction and monitoring. Several papers dealing with prediction with incomplete measurements using PCA and Partial Least Squares (PLS) models have also been written (e.g., Nelson et al. 7, Arteaga and Ferrer 8, and Nelson et al. 9).
In this work we propose a Nonlinear Programming (NLP) based method for obtaining the parameters for the PCA model in the presence of missing values. By solving a constrained NLP the problems associated with the modified NIPALS algorithm are eliminated. Moreover, this method solves directly the minimization of the squared error in the prediction, and thus, it eliminates the necessity to solve multiple PCA problems with different imputed values. This reduces the amount of calculations required to obtain the parameters as in the EM algorithm, but we still obtain the optimal solution in the least squares sense. This approach is illustrated using randomly generated data, as well as, with data obtained from the pharmaceutical industry.

References


[1]    Wold, H. Estimation of Principal Components and Related Models by Iterative Least Squares. In Krishnaiah, P., ed., Multivariate Analysis. Academic Press, New York, 1966, pp. 391–420.


[2]    Little, R. J. A.; Rubin, D. B. Statistical Analysis with Missing Data. Second ed. Wiley-Interscience, 2002.


[3]    Grung, B.; Manne, R. Missing Values in Principal Component Analysis. Chemom. Intell. Lab. Syst. 1998, 42, 125.


[4]    Adams, E.; Walczak, B.; Vervaet, C.; Risha, P. G.; Massart, D. Principal Component Analysis of Dissolution Data with Missing Elements. Int. J. Pharm. 2002, 234, 169.


[5]    Walczak, B.; Massart, D. L. Dealing with Missing Data: Part I. Chemom. Intell. Lab. Syst. 2001, 58, 15.


[6]    Walczak, B.; Massart, D. L. Dealing with Missing Data: Part II. Chemom. Intell. Lab. Syst. 2001, 58, 29.


[7]    Nelson, P. R. C.; Taylor, P. A.; MacGregor, J. F. Missing Data Methods in PCA and PLS: Score Calculations with Incomplete Observations. Chemom. Intell. Lab. Syst. 1996, 35 (1), 45.


[8]    Arteaga, F.; Ferrer, A. Dealing with Missing Data in MSPC: Several Methods, Different Interpretations, Some Examples. J. Chemom. 2002, 16, 408.


[9]    Nelson, P. R. C.; MacGregor, J. F.; Taylor, P. A. The Impact of Missing Measurements on PCA and PLS Prediction and Monitoring Applicatons. Chemom. Intell. Lab. Syst. 2006, 80 (1), 1.