# (104g) Automated Dimension Reduction with Principal Component Analysis Using Area Under Curve

#### AIChE Annual Meeting

#### 2021

#### 2021 Annual Meeting

#### Computing and Systems Technology Division

#### Advances in Machine Learning and Intelligent Systems I

#### Monday, November 8, 2021 - 2:00pm to 2:15pm

Optimal selection of the model hyper-parameter has been proposed in many unsupervised learning algorithms, including the Akaike information criterion [2], Kaiser rule [3], parallel analysis [4], and the variance of reconstruction error (VRE) [5]. These methods assume: (i) linear dependence of the measurements on an unknown number of latent variables and (ii) measurement errors are sampled independently from the same univariate normal distribution. A cross-validated ignorance score [6] that interprets PCA as a density model, as in probabilistic principal component analysis (PPCA) [7], was proposed to find the optimal number of latent variables in PCA. However, this method assumes, as in PPCA, that the latent variables are normally distributed.

In this work, we present an automated latent variable model selection in PCA without assuming that the latent variables are normally distributed. To this end, the data X âˆˆ R ^{m x n}, where m is the number of samples and n is the number of variables, is split into calibration X_{c} âˆˆ R ^{mc x n} and validation X_{v} âˆˆ R ^{mvx n }blocks, where m_{c} + m_{v} = m. We then create a randomized copy of the validation data block by random permutation of the row position of the elements, this for each column separately. This randomized data block is called the faulty data block defined by X_{f} âˆˆ R ^{mvx n}. The original validation data block contains normal data, which a chosen model should be able to describe adequately. In contrast, the faulty data block represents anomalous data that a suitable PCA model will not describe. This enables the transformation of an unsupervised learning problem into a supervised one. The model is calibrated using the calibration data with a given number of PCs. The PCs are calculated using the singular value decomposition of the mean-centered covariance matrix of the calibration block provided by Y_{m}^{T }* Y_{m}/(m_{c}-1)=UÎ£U^{T}, where Y_{m} = X_{c }-**1***M^{T}; **1 **is a column of all 1â€™s; M is the column wise mean vector; U is the principal scores; Î£ is a diagonal matrix containing all nonzero singular values (variances), ordered from largest to smallest. A model with K principal components can then be chosen as PC_{K}=U(1:K), where U(1:K) represents the first K columns of U. We then compute the value of the Q statistic [8] for the data samples in the validation and faulty blocks. The Q statistic is calculated by first obtaining an approximation of the validation and faulty block by applying a projection and reconstruction step in sequence given by Y_{v}*PC_{K}*PC_{K}^{T} and Y_{f}*PC_{K}*PC_{K}^{T}, where Y_{v }= X_{v }-**1***M^{T} and Y_{f }= X_{f }-**1***M^{T}. Then, the Q statistic for the validation and faulty block for each sample is given by

Q_{v}(i) = E_{v}(i)* E_{v}^{T}(i) and Q_{f}(i) = E_{f}(i)* E_{f}^{T}(i)_{, }for i = 1,..., m_{v}

where E_{v}(i) and E_{f}(i) denote the i^{th }rows of the residual error matrices E_{v }= Y_{v }- Y_{v}*PC_{K}*PC_{K}^{T}, and E_{f }= Y_{f }- Y_{f}*PC_{K}*PC_{K}^{T}.

A receiver operator curve (ROC) [9] is constructed by plotting the true positive rate (TPR), computed with the faulty data block, against the false positive rate (FPR), computed with the validation block. The ROC curve is constructed by computing the TPR and FPR values, as follows:

- Generate a 2*m
_{c}dimensional vector Q by stacking Q_{v}and Q_{f}atop each other. - Sort the values in Q in ascending order.
- Choose an upper control limit as Q
_{UCL}= Q(i). The FPR is computed as the fraction of values in Q_{v}that are higher than Q_{UCL}, and the TPR is computed as the fraction of values in Q_{f}that are higher than Q_{UCL}. The computed (FPR, TPR) pair is a single point on the ROC. The complete ROC is generated by a sweep of the upper control limit, i.e., Q_{UCL}=Q(i) for i=1,....,2*m_{c}. - Obtain the ROC by plotting TPR as a function of the FPR.

Once constructed, we choose the optimal number of latent variables (K) by comparing the ROCs available for every candidate K's. To do this automatically, we calculate the area under the curve (AUC) [9], which is computed as the integral of the piece-wise constant function defined by the FPR-TPR pairs representing the ROC curve. The AUC serves as a metric on a chosen model's performance to separate anomalous or faulty data samples from normal ones. The model for which the AUC has the maximum value above 0.5 (which represents a random classifier) has the best classifier performance. The number of PCs in that model is, therefore, the chosen number of latent variables.

To show the algorithm's efficacy, we use the simulated data class B1 given in [6]. We use the data types B1.01- B1.06 in the data class B1. For this data set, the known number of true PCs is three, and the number of variables is nine. The data type B1.01- B1.06 are simulated with a covariance matrix composed of blocks, each of which has the same value in all row-column positions. The covariance matrix blocks are chosen such that there is a gradual decrease in the fraction of non-noisy variance to the total variance from data type B1.01 to B1.06. We run 100 iterations of the algorithm for each data type. The accuracy is defined as the fraction of the number of instances of a data set for which the identified number of PCs matches the ground truth value exactly. In our simulation results, the proposed algorithm selects the optimal number of latent variables with 97% accuracy for the data types B1.01- B1.06.

**References**

[1] Jolliffe, Ian T., and Jorge Cadima. Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374.2065 (2016): 20150202.

[2] Akaike, H. Information theory and an extension of the maximum likelihood principle. In Proceedings 2nd International Symposium on Information Theory; Petrov and Caski, Eds., 1974; pp 267-281.

[3] Kaiser, H. F. The application of electronic computers to factor analysis. Educ. Psychol. Meas. 1960, 20 (1), 141-151.

[4] Zwick, W. R.; Velicer, W. F. Comparison of five rules for determining the number of components to retain. Psychol. Bull. 1986, 99 (3), 432-442.

[5] Valle, Sergio, Weihua Li, and S. Joe Qin. Selection of the number of principal components: the variance of the reconstruction error criterion with a comparison to other methods. Industrial & Engineering Chemistry Research 38.11 (1999): 4389-4401.

[6] Russo, Stefania, Guangyu Li, and Kris Villez. Automated model selection in principal component analysis: a new approach based on the cross-validated ignorance score. Industrial & Engineering Chemistry Research 58.30 (2019): 13448-13468.

[7] Tipping, E.; Bishop, C. Probabilistic principal component analysis. J. R. Stat. Soc. Ser. B-Stat. Methodol. 1999, 61, 611âˆ’622.

[8] Montgomery, D.C., 2001. Introduction to Statistical Quality Control, fourth ed. Wiley, New York, NY

[9] Fawcett, Tom. An introduction to ROC analysis. Pattern recognition letters 27.8 (2006): 861-874.