(199h) Improving Near-Infrared Spectrum-Based Soft Sensor Performance through Variable Selection

Authors: 
Lee, J., Auburn University
Wang, J., Auburn University
He, Q. P., Auburn University
Improving near-infrared spectrum-based soft sensor performance through variable selection

Jangwon Lee, Jin Wang, Q. Peter He

In the last few decades, spectroscopic techniques such as near-infrared (NIR) spectroscopy have gained wide applications in a variety of different fields such as petrochemical, pharmaceutical and food industries. As a result, various soft sensors have been developed to predict sample properties from its spectroscopic readings. However, there are three challenges that need to be addressed when using NIR data for soft sensor development: (1) multicollinearity, i.e., the readings at different wavelengths are highly correlated; (2) spectrum noise; (3) curse of dimensionality, i.e., usually fewer samples than number of variables are available. Variable selection is a good approach to address these challenges and many variable selection methods have been developed. However, all existing variable selection methods focus on selecting the variables (i.e., wavelengths or wavelength segments) that are strongly correlated with the dependent variable to improve the prediction performance. Although many successful applications have been reported, such variable selection methods do have their limitations, such as wavelengths selected do not have clear relationship with the chemical bounds or functional groups presenting in the sample. As a result, these methods could have robustness issue. Specifically, their performances can be highly sensitive to the choice of training data and their performance can deteriorate noticeably when applied to new samples. In this work, we present a novel variable selection method based on the fundamental principle of “survival of fittest”. The proposed method consists of two steps. Step 1 is to construct a library of informative variables (i.e., chromosomes). Step 2 is to determine optimal variables based on the chromosome library. The proposed method integrates the variable stability and variable importance in the projection (VIP) score and transform them into the probability of variable importance. By incorporating the random selection principle from the genetic algorithm (GA), variables are randomly selected based on the variable importance probabilities, which prevents certain wavelengths never get selected if deterministic variable importance criteria are used. In this study, using a variety of NIR data related to petrochemical, pharmaceutical and food industries, we compare the performance of the proposed method to the existing variable selection methods based on the survival of fittest, including competitive adaptive reweighing sampling (CARS), stability and variable permutation (SVP), genetic algorithm (GA), etc. We show that the proposed method has several advantages: (1) significantly better performance in all case studies; (2) identifies wavelength segments that are clearly related to the chemical bounds and functional groups presenting in the sample; (3) better robustness; (4) fewer parameters and simpler tuning/training than GA.