(103g) Subset Selection in Multiple Linear Regression | AIChE

(103g) Subset Selection in Multiple Linear Regression

Authors 

Wilson, Z. - Presenter, Carnegie Mellon University
Sahinidis, N., Carnegie Mellon University

The ALAMO methodology has been developed recently to address the problem of discovering simple algebraic models from data obtained from simulations or experiments [1]. An important part of building these surrogate models is selecting the best subset from a large number of linear and nonlinear functions of explanatory variables. The selected subset will balance model complexity with the goodness-of-fit of the model in order to uncover underlying physical relationships instead of overfitting to the noise in the data.

The purpose of this paper is to present a systematic analysis of new and existing approaches to the subset selection problem encountered in ALAMO.  The same problem of subset selection arises naturally in a variety of applications and has been the subject of study in the machine learning and statistics literatures [2, 3].  Yet, an effective solution approach to this problem is still elusive due to its highly combinatorial and nonlinear nature. It is often the practice to use greedy step-wise heuristics to produce a good fitting subset of regression variables [2]. These heuristics typically use different model fitness metrics, including Akaike’s Information Criterion and Mallows’ Cp, in order to define a stopping point. We compare these heuristic stepwise algorithms, exhaustive search algorithms [3], and newly proposed direct optimization of integer programming formulations for several different model selection criteria. For this purpose, we use a large test set with problems from a variety of applications.

References

[1]    Cozad, A., N. V. Sahinidis, and D. C. Miller, Automatic learning of algebraic models for optimization, AIChE Journal, 60, 2211-2227, 2014.

[2]    Miller, A. J. (1990). Subset selection in regression. London [England]: Chapman and Hall.

[3]    Furnival, G.M. and R. W. Wilson (1974). Regression by leaps and bounds, Technometrics, 16, 499-511.