(504a) Coefficient-Setups for Subset Selection in Model Structure Identification Problems Using MILP/?­I­Q­P

Franzoi, R. E., University of São Paulo
Menezes, B. C., Hamad Bin Khalifa University, Qatar Foundation
Kelly, J. D., Industrial Algorithms
Process transformations inside unit-operations of both batch- and continuous-process can be modeled as black-box (or grey-box) model types, whereby we assign setup binary variables to each coefficient continuous variable. These setups allow us to address what is known in artificial intelligence, machine learning and data science as “subset selection” of coefficients or parameters to determine the black-box model structure as an identification problem as opposed to an estimation problem, where the model structure is exogenous or given. See more on machine learning in the chemical engineering in Shuang et al. [1].

In the proposed coefficient-setup formulation, the trial models must be linear with respect to all of the coefficients or parameters found in the full model. The subset selection identification problem is then a discrete, linear and dynamic regression problem where an automated and systematic search employing branch-and-bound via either MILP (1-norm performance weights) or MIQP (2-norm performance weights) is proposed in this work. The automated search determines which of the possible or potential coefficients (or equivalently the trial model terms) are key or significant from the perspective of explaining most of the variance or uncertainty in the regression or performance objective function known as the sum of absolute (1-norm) or squared (2-norm) residuals or residual sum of absolute or squared errors although other objective functions may be configured such as Mallow’s Cp [2].

In order to configure the subset selection problem, the configuration of setups for each coefficient is what we refer to as coefficient-setups which are logic variables where their lower and upper bounds may vary from zero (0) or one (1) as they are binary. When a coefficient-setup exists for a coefficient, semi-continuous constraints for each coefficient variable are created. These coefficient-setups may also be optimized in the MILP / MIQP using an objective function weight. To limit or constrict the number of coefficients and coefficient-setups active, enabled, on, open, up, etc., for any given black-box unit-operation in the identification problem, this is configured using lower and upper bounds for the number of setups which forms arity or cardinality bounds on the total number of coefficients and are well-known to enforce some notion, level or degree of parsimony or Occam’s razor in the identification problem solution. The primary goal of any model structure identification methodology is to provide the best causal fit of the data with the least number of parameters in order to minimize the potential of over-fitting and under-fitting the data. To provide an order, precedence, sequence or rank in terms of the relationship of one coefficient / coefficient-setup to another, sequence-dependency or precedence-ordering constraint may be added. This is useful especially for dynamic (time-series) regression problems which require certain temporal or lagged ranking of the coefficients.

Finally, we need to emphasize that industrial identification and subset selection problems are salient to the continuous and batch process industries as there are many sophisticated processes and sub-processes that are unfortunately not that well understood at least from a white-box, first-principles, mechanistic, chemistry- or physics-based perspective especially transformation or reaction types of unit-operations such as converters, crackers, reformers, digesters, treaters, etc. with complex reactants, thermodynamics, catalysts, equilibria and kinetics. By modeling a superset of algebraic and empirical engineering-based trial models such as polynomial and rational functions (quotient of polynomials) or any other potential basis function, the MILP / MIQP can perform the automated search to find the best subset of coefficients relating to each trial model terms in the problem’s algebraic formulation where the output (dependent, response, regressand) phenomenological variables are typically related to product yields and properties and the input (independent, explanatory, regressor) variables are usually feed flows, densities, components and properties as well as unit-operation conditions such as temperatures, pressures, severities, space-velocities, residence-times, catalyst activities, etc. Ultimately the resulting reduced-order model may be used as a nonlinear surrogate, digital twin or cyber-physical virtual representation of the actual, physical or real system that can be optimized, estimated or just monitored in off-line or on-line environments.

[1] Shuang B, Mehta A, Marshall K, Zhang T, Sahinidis NV. (2019). Explore the Potential of Machine Learning in Building Reaction Models for Chemical Industry. In AIChE Annual Meeting. Pittsburgh, PA, United States.

[2] Miyashiro R, Takano Y. (2014). Subset Selection by Mallows ’ Cp : A MIP Approach 1 Subset Selection by Mallows ’ C p : A Mixed Integer Programming Approach.