# (317d) An Optimization-Based Strategy to Handle Missing Data in Partial Least Squares

Authors:
Carnegie Mellon University
Carnegie Mellon University

An Optimization-based Strategy to Handle Missing Data in Partial Least Squares

E. Harinath, S. Garcia-Munoz and L. T. Biegler

Latent variable methods have proved to be a powerful approach for the development of data-driven process models. The objective of these methods is to find hidden structures of large data sets consisting of highly correlated variables. The basic idea of these techniques is to decompose regression space X, or both the regression space and response space Y, into subspaces which are spanned by base vectors, called scores. For an example, in Principal Component Regression (PCR) method, score vectors of X, called principal components, are found by maximizing variance in X space. Then the principal components are regressed on Y space to predict response. However, PCR will not use any information available in Y space to find dominant direction but it simply explains only X space. In contrast to PCR, Partial Least Square (PLS) methods find score vectors for both X and Y, by maximizing covariance of X and Y. Thus PLS finds the dominant direction common to both X and Y spaces.

The most popular method for finding the score and loading vectors in PLS is the Nonlinear Iterative Partial Least Squares (NIPALS) algorithm. The key step of NIPALS is the deflation process in each iteration of the algorithm. In [1], a version of NIPALS algorithm is presented and analyzed its basic properties such as orthogonality of score and weighting vectors for X space. These properties are mainly due to the deflation process in the algorithm.

In process industry, there are many situations where input output data sets are having missing data points. For example, when sensors fails or are taken for maintenance, and loss of historical data or missing experiments [2-3], input output data lead to incomplete data sets. With incomplete data sets, the NIPALS algorithm may fail to provide the orthogonality properties of scores and weightings, which are essential for better interpretation of the resultant model. In [3], it is shown for PCA with incomplete data sets that the loading and score vectors determined by a modified NIPALS algorithm are not orthogonal.

To address problems associated with the modified NIPALS for PCA, an efficient nonlinear optimization framework is presented in [3]. The deflation process in PLS is difficult to formulate in an optimization framework in case of incomplete data. In [4], different optimization frameworks are discussed for multivariate regression methods. In particular, the undeflatedPLS (UDPLS) method is presented in order to remove the deflation process for complete data set problem. In this work, we develop an UDPLS technique for missing data problem. This work is in line with the work presented for PCA in [3]. The proposed technique is based on Nonlinear Programming (NLP) methods. We present simulation case studies where we compare the parameters obtained from both the modified NIPALS and proposed algorithms, and demonstrate the effectiveness of our optimization-based approach.

[1] Agnar Höskuldsson. PLS regression methods. Journal of Chemometrics, 2(3):211–228, 1988.

[2] Philip R.C. Nelson, Paul A. Taylor, and John F. MacGregor. Missing data methods in PCA and PLS: Score calculations with incomplete observations. Chemometrics and Intelligent Laboratory Systems, 35(1):45 – 65, 1996.

[3] Rodrigo López-Negrete de la Fuente, Salvador García-Munoz, and Lorenz T. Biegler. An efficient nonlinear programming strategy for PCA models with incomplete data sets. Journal of Chemometrics, 24(6):301–311, 2010.

[4] Alison J. Burnham, Roman Viveros, and John F. MacGregor. Frameworks for latent variable multivariate regression. Journal of Chemometrics, 10(1):31–45, 1996.

### Checkout

This paper has an Extended Abstract file available; you must purchase the conference proceedings to access it.

### Pricing

#### Individuals

##### 2013 AIChE Annual Meeting
 AIChE Members \$150.00 AIChE Graduate Student Members Free AIChE Undergraduate Student Members Free Non-Members \$225.00
##### Pharmaceutical Discovery, Development and Manufacturing Forum only
 AIChE Members \$100.00 AIChE Food, Pharmaceutical & Bioengineering Division Members Free AIChE Graduate Student Members Free AIChE Undergraduate Student Members Free Non-Members \$150.00