(103g) Subset Selection in Multiple Linear Regression
The ALAMO methodology has been developed recently to address the problem of discovering simple algebraic models from data obtained from simulations or experiments . An important part of building these surrogate models is selecting the best subset from a large number of linear and nonlinear functions of explanatory variables. The selected subset will balance model complexity with the goodness-of-fit of the model in order to uncover underlying physical relationships instead of overfitting to the noise in the data.
The purpose of this paper is to present a systematic analysis of new and existing approaches to the subset selection problem encountered in ALAMO. The same problem of subset selection arises naturally in a variety of applications and has been the subject of study in the machine learning and statistics literatures [2, 3]. Yet, an effective solution approach to this problem is still elusive due to its highly combinatorial and nonlinear nature. It is often the practice to use greedy step-wise heuristics to produce a good fitting subset of regression variables . These heuristics typically use different model fitness metrics, including Akaike’s Information Criterion and Mallows’ Cp, in order to define a stopping point. We compare these heuristic stepwise algorithms, exhaustive search algorithms , and newly proposed direct optimization of integer programming formulations for several different model selection criteria. For this purpose, we use a large test set with problems from a variety of applications.
 Cozad, A., N. V. Sahinidis, and D. C. Miller, Automatic learning of algebraic models for optimization, AIChE Journal, 60, 2211-2227, 2014.
 Miller, A. J. (1990). Subset selection in regression. London [England]: Chapman and Hall.
 Furnival, G.M. and R. W. Wilson (1974). Regression by leaps and bounds, Technometrics, 16, 499-511.