(177a) Obtaining Parsimonious Regression Models with Large Datasets | AIChE

(177a) Obtaining Parsimonious Regression Models with Large Datasets

Authors 

Schmidt, A. - Presenter, The Dow Chemical Company
Rendall, R., University of Coimbra
Chiang, L., Dow Inc.
Chin, S. T., The Dow Chemical Company
Reis, M., University of Coimbra
The advantages of using feature selection methods[1] have been long established in the literature and include lower dimensionality of the resulting dataset, easier identification of important features and lower risk of overfitting. These advantages are critical assets in today’s context of big data and in the advent of Industry 4.0 initiatives, where a large and increasing number of variables and features are obtained. However, larger data sets do not always correspond to additional sources of variability and the opposite scenario is often observed: the ratio between the critical features and the irrelevant features tends to decrease to zero as the data set increases. Therefore, in this paper, a two-stage approach to develop regression models for large data sets is proposed. In the first stage, a feature selection method is proposed with the objective of removing noisy and irrelevant features, while also reducing the number of missed detections or the number of relevant variables not selected. In the second stage, the set of selected features is then combined with a regression method, which might incorporate a further step of feature selection. Two simulated datasets for continuous and batch processes, and an industrial dataset were used as case studies to validate the effectiveness of the proposed approach. Using feature selection methods lead to the selection of important predictor variables while also improving prediction performance since the models developed were more parsimonious.

Three main classes of feature selection approaches are available[2]: filter methods, wrapping methods and embedded methods. Filters are mostly based on univariate measures of association between predictors and response variables, and tend to be more efficient than wrappers and embedded methods. Analyzing the literature on filter methods, a comprehensive list of different filters has been proposed and tested for classification tasks[3], while filters for regression problems remain vastly unexplored[4]. This research focuses on mitigating this gap by assessing and comparing the performance of different filters for feature selection in regression problems. Various association metrics are considered, including Pearson’s correlation coefficient, Spearman’s correlation, Kendall’s correlation, mutual information, and also combinations of mutual information with other filters. These filters have the flexibility to account for various relationships between predictors and response variables and can capture linear correlations (Pearson’s correlation), monotonic relationships (Spearman’s correlation) and non-linear associations (mutual information).

Two key performance indicators (KPI) are utilized to quantify the performance of the different filters. The first KPI measures whether relevant variables are selected and noisy ones are removed. This KPI is only used in a simulation setting because the data-generating mechanisms are known. The second KPI assesses the improvements in prediction performance obtained when different filters are used prior to model building. Furthermore, various regression methods, including Partial Least Squares (PLS) regression, were also tested in order to assess their interactions with the filters considered.

References

[1] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artif. Intell., vol. 97, no. 1–2, pp. 273–324, 1997.

[2] V. Kumar and S. Minz, “Feature Selection : A literature Review,” Smart Comput. Rev., vol. 4, no. 3, pp. 211–229, 2014.

[3] G. Chandrashekar and F. Sahin, “A survey on feature selection methods,” Comput. Electr. Eng., vol. 40, no. 1, pp. 16–28, 2014.

[4] I. Guyon, A. Elisseeff, and A. M. De, “An Introduction to Variable and Feature Selection,” J. Mach. Learn. Res., vol. 3, pp. 1157–1182, 2003.