2018 Spring Meeting and 14th Global Congress on Process Safety

(177a) Obtaining Parsimonious Regression Models with Large Datasets

Checkout You must be logged in to view this content. Log in now.

Pricing

Individuals

List Price	225.00
AIChE Pro Members	150.00
Employees of CCPS Member Companies	150.00
AIChE Graduate Student Members	Free
AIChE Undergraduate Student Members	Free

Authors

Ivan Castillo - Presenter, Dow Inc.

Alix Schmidt - Presenter, The Dow Chemical Company

Ricardo Rendall, University of Coimbra

Leo Chiang, Dow Inc.

Swee-Teng Chin, The Dow Chemical Company

Marco Reis, University of Coimbra

The advantages of using feature selection methods[1] have been long established in the literature and include lower dimensionality of the resulting dataset, easier identification of important features and lower risk of overfitting. These advantages are critical assets in todayâs context of big data and in the advent of Industry 4.0 initiatives, where a large and increasing number of variables and features are obtained. However, larger data sets do not always correspond to additional sources of variability and the opposite scenario is often observed: the ratio between the critical features and the irrelevant features tends to decrease to zero as the data set increases. Therefore, in this paper, a two-stage approach to develop regression models for large data sets is proposed. In the first stage, a feature selection method is proposed with the objective of removing noisy and irrelevant features, while also reducing the number of missed detections or the number of relevant variables not selected. In the second stage, the set of selected features is then combined with a regression method, which might incorporate a further step of feature selection. Two simulated datasets for continuous and batch processes, and an industrial dataset were used as case studies to validate the effectiveness of the proposed approach. Using feature selection methods lead to the selection of important predictor variables while also improving prediction performance since the models developed were more parsimonious.

Three main classes of feature selection approaches are available[2]: filter methods, wrapping methods and embedded methods. Filters are mostly based on univariate measures of association between predictors and response variables, and tend to be more efficient than wrappers and embedded methods. Analyzing the literature on filter methods, a comprehensive list of different filters has been proposed and tested for classification tasks[3], while filters for regression problems remain vastly unexplored[4]. This research focuses on mitigating this gap by assessing and comparing the performance of different filters for feature selection in regression problems. Various association metrics are considered, including Pearsonâs correlation coefficient, Spearmanâs correlation, Kendallâs correlation, mutual information, and also combinations of mutual information with other filters. These filters have the flexibility to account for various relationships between predictors and response variables and can capture linear correlations (Pearsonâs correlation), monotonic relationships (Spearmanâs correlation) and non-linear associations (mutual information).

Two key performance indicators (KPI) are utilized to quantify the performance of the different filters. The first KPI measures whether relevant variables are selected and noisy ones are removed. This KPI is only used in a simulation setting because the data-generating mechanisms are known. The second KPI assesses the improvements in prediction performance obtained when different filters are used prior to model building. Furthermore, various regression methods, including Partial Least Squares (PLS) regression, were also tested in order to assess their interactions with the filters considered.

References

[1] R. Kohavi and G. H. John, âWrappers for feature subset selection,â Artif. Intell., vol. 97, no. 1â2, pp. 273â324, 1997.

[2] V. Kumar and S. Minz, âFeature Selection : A literature Review,â Smart Comput. Rev., vol. 4, no. 3, pp. 211â229, 2014.

[3] G. Chandrashekar and F. Sahin, âA survey on feature selection methods,â Comput. Electr. Eng., vol. 40, no. 1, pp. 16â28, 2014.

[4] I. Guyon, A. Elisseeff, and A. M. De, âAn Introduction to Variable and Feature Selection,â J. Mach. Learn. Res., vol. 3, pp. 1157â1182, 2003.

Breadcrumb

2018 Spring Meeting and 14th Global Congress on Process Safety

(177a) Obtaining Parsimonious Regression Models with Large Datasets

Authors