(646c) An Information Entropy Based Criterion for Variable Selection Performance Assessment

Authors: 
He, Q. P., Auburn University
Suthar, K., Auburn University
Lee, J., Auburn University
With ever-accelerating advancement of information, communication, sensing and characterization technologies, tremendous amount of data are generated and stored every day. Those so called “Big Data” are often extremely high-dimensional, contaminated by noise, and interspersed with a large number of irrelevant or redundant features, making it a challenging task to retrieve useful information from the data[1], [2]. Variable selection has been one of the practical approaches to reducing data dimensionality prior to data interpretation or modeling. Even for projection-based dimension reduction methods such as principal component analysis (PCA) and partial least squares (PLS), variable selection is often applied as a pre-processing step to further improve the modeling performance[3], [4]. In the last few years, many different variable selection methods have been reported. However, how to evaluate, in particular directly evaluate, the performance of variable selection methods has received limited attention. The commonly applied criteria to assess variable selection performance either indirectly measures the effects of variable selection, such as through prediction performance of a model, or require ground trough of variable relevancy, which is not available in practical applications.

To address this limitation, this paper presents an information entropy based consistency index (Ic) to directly evaluate the performance of variable selection method. The proposed method is based on the hypothesis that the same set of relevant variables would be selected when different training data sets are utilized to build a model. Therefore the proposed Ic index examines the consistency among variables being selected using different training data. The proposed Ic does not require any ground truth of variable relevancy, but can still make use of such information should it is available. Both simulated (with ground truth) and industrial (without ground truth) case studies are provide to demonstrate how Ic performs, which is compared with commonly used criteria. It is shown that the proposed index overcomes some of the limitations of existing indices, and the simulated case studies in this work show that Ic gave more objective assessments than the existing indices. The industrial case study shows that Ic is highly correlated with the performance of the resulted soft sensor, validating the need and benefits of directly assessing variable selection consistency.

References:

[1] J.-A. Ting, A. D’Souza, S. Vijayakumar, and S. Schaal, “Efficient learning and feature selection in high-dimensional regression,” Neural Comput., vol. 22, pp. 831–886, 2010.

[2] L. Comminges and A. S. Dalalyan, “Tight conditions for consistent variable selection in high dimensional nonparametric regression.,” in COLT, 2011, pp. 187–206.

[3] Z. X. Wang, Q. He, and J. Wang, “Comparison of different variable selection methods for partial least squares soft sensor development,” in 2014 American Control Conference, 2014, pp. 3116–3121.

[4] Z. X. Wang, Q. P. He, and J. Wang, “Comparison of variable selection methods for PLS-based soft sensor modeling,” J. Process. Control., vol. 26, pp. 56–72, 2015.