(112c) A Novel Variable Selection Method for Spectrum-Based Soft Sensor Development

He, Q. P., Auburn University
Lee, J., Auburn University
Wang, J., Auburn University

A Novel Variable
Selection Method for Spectrum-based Soft Sensor Development

Jangwon Lee*, Jin
Wang*, Q. Peter He*+

of Chemical Engineering, Auburn University, Auburn, AL 36849 USA

jzl0164@auburn.edu; JW: wang@auburn.edu; +QPH: qhe@auburn.edu)

In the last few decades, spectroscopic techniques such as
near-infrared (NIR) spectroscopy have gained wide applications in oil and gas
industry. As a result, various soft sensors have been developed to predict
sample properties from its spectroscopic readings. Because the readings at
different wavelengths are highly correlated, it has been shown that variable
selection could significantly improve a soft sensor’s prediction performance
and reduce the model complexity. Existing variable selection methods focus on
selecting the variables (i.e., wavelengths or wavelength segments) that are
strongly correlated with the dependent variable to improve the prediction
performance. Although many successful applications have been reported, such
variable selection methods do have their limitations, such as wavelengths
selected do not have clear relationship with the chemical bounds or functional
groups presenting in the sample. As a result, these methods could face
robustness issue and their performances can be highly sensitive to the choice
of training data, and deteriorated performance when testing on new samples.

In this work, we present a novel variable selection method
that integrate the variable stability and variable importance in the projection
(VIP) score and transform them into the probability of variable importance. By incorporating
the random selection principle from the genetic algorithm (GA), variables are randomly
selected based on the variable importance probabilities, which prevents certain
wavelengths never get selected if deterministic variable importance criteria are
used. Using several case studies including gasoline and biodiesel, we compare
the performance of the proposed method to the existing variable selection
methods, including competitive adaptive reweighing sampling (CARS), variable importance
in the projection (VIP), genetic algorithm (GA), etc. We show that the proposed
method has several advantages: (1) significantly better performance in all case
studies; (2) identifies wavelength segments that are clearly related to the
chemical bounds and functional groups presenting in the sample; (3) better
robustness; (4) fewer parameters and much simpler tuning/training than GA.