(126a) On the Value of Global Optimality in Machine Learning Problems: The Case of Feature Selection for Linear Support Vector Machines
AIChE Annual Meeting
2018
2018 AIChE Annual Meeting
Computing and Systems Technology Division
Advances in Machine Learning and Intelligent Systems
Monday, October 29, 2018 - 12:30pm to 12:49pm
In this talk, we will present our recent results investigating the value of global optimality in the context of zero-norm feature selection for linear support vector machines (SVMs). Linear SVMs aim to classify data points in a high-dimensional space into one or more clusters based on separating hyperplanes. The individual components of each data point are called features. For example, the features may represent expression levels for various genes in a cell, while the clusters might represent different disease classes. In this context, feature selection refers to the problem of selecting a specified number K of features (e.g., genes) from the full set such that classifying the data based on only those K features is as good or better than classifying the data based on any other choice of K features. This problem is motivated by the observation that classifiers based on fewer features often generalize better from the training data to new data sets, especially when there are many noisy features in the original data set. Moreover, some interesting applications call for a very aggressive reduction in the number of features used for classification. For example, in the design of biosensors that aim to infer the levels of pollutants or toxins in their environment based on the gene expression of immobilized microbes, it is impractical to accurately measure more than a small number of genes (~5).
In general, zero-norm feature selection is a very large-scale mixed-integer quadratic optimization problem (MIQP) that is difficult to solve to global optimality. However, modern MIQP solvers are capable of solving such problems for data sets of modest size. Using several standard data sets in this size range from various application areas, we tested the classification accuracy of SVMs with fixed numbers of features obtained by solving the zero-norm feature selection problem to guaranteed global optimality, as well as by using a variety of more efficient approximations that are in common use but furnish only approximate solutions. Moreover, to more directly test the correlation between suboptimality and classification accuracy, we also developed a rigorous technique for computing a sequence of progressively more suboptimal SVMs for each test problem in a controlled manner, ranging from globally optimal solutions to solutions with objective values that deviate from global optimality by up to 25%. For all of the SVMs obtained, we then evaluated their classification errors on new data using a standard five-fold validation procedure. Interestingly, our results indicate that, when a moderate to large number of features are allowed to be retained (>15%), there is very little correlation between suboptimality and classification error. In other words, SVMs that achieve objective values within 10-20% of global optimality during training performed just as well as global solutions on average when classifying new data. In contrast, when only a small number of features are allowed to be retained (~5%), as in the biosensor example mentioned above, a very clear positive correlation emerges between suboptimality during training and degraded classification performance on new data sets. Moreover, in comparisons with SVMs obtained by existing state-of-the-art feature selection algorithms, the SVMs obtained by rigorously solving the feature selection problem to global optimality were not substantially more accurate at high feature selection levels, but showed significant advantages at low feature selection levels (~5%). In this talk, we will present the data supporting these conclusions for several test cases, provide hypotheses for these surprising trends with supporting evidence, and discuss the implications of these trends for the usefulness and practicality of rigorous global optimization for feature selection.