(83c) Empirical Models for Analyzing BIG Data. What's the Difference? | AIChE

(83c) Empirical Models for Analyzing BIG Data. What's the Difference?


MacGregor, J. - Presenter, ProSensus Inc.
Many issues around “BIG Data” (eg. collection, warehousing, integration, cloud services) are just infrastructure issues that need to be improved in order to ultimately use the data to extract actionable information. If the data analysis step is not performed well, then most of the infrastructure effort will have been wasted. To analyze historical data, one needs to make use of models, usually empirical – such as regression, data mining (deep learning neural networks, decision trees, etc.) or latent variable models. My PhD supervisor in Statistics (G.E.P. Box) used to often say “All models are wrong, but some are useful”. The problem is that most people lump all empirical models into one category – “empirical models” - as though any of these models are interchangeable irrespective of the nature of the data or the objectives of the problem.

But, whether a model is useful depends upon three factors:

  1. The objectives of the model
  2. The nature of the data used for the modeling
  3. The regression method used to build the model

From an objective point of view, there are basically two major classes of models – those to be used for passive use and those to be used for active use. Models for passive use are intended to be used just to passively observe the process in the future. Such passive applications include classification, inferentials or soft sensors, and process monitoring (MSPC). For such passive uses one does not need or even want causal models, rather one wants to just model the normal variations common to the operating process. Historical data is ideal for building such models. Models for active use are intended to be used to actively alter the process. Such active applications include using the models to optimize or control the process or to trouble-shoot process problems or gain causal information from the data. For such active use one needs causal models. Causality implies that for any active changes in the adjustable or manipulatable variables in the process, the model will reliably predict the changes in the output of interest.

The problem is that most of the data available in industry is historical operating data and such data almost never contain causal information on individual variables. This poses a major problem if we want causal models. The question then becomes what analysis methods are useful for obtaining active models from historical operating data?

Machine Learning (ML) methods are currently the rage in “BIG Data” communities. These include deep learning neural networks, and massive decision trees. These new ML approaches are improvements on the older “shallow” neural networks (a few layers connecting all variables to all nodes) and single large decision trees, both of which led to overfitting of the data and to large variances in the predictions. The newer ML approaches are aimed at overcoming some of these deficiencies. Deep learning NN’s use many simplified layers, regularization and averaging to reduce the effective number of parameters and the overfitting. New decision trees involve building many decision trees based on fewer randomly selected variable and then averaging or voting on the results to effectively reduce the variance and bias of the results.

These newer ML methods can be very good for passive uses. But they cannot be used for extracting interpretable or causal models from historical data for active use. With historical data, there are an infinite number of models that can arise from any of these machine learning methods, all of which might provide good predictions of the outputs, but none of which is unique or causal. This does not allow for meaningful interpretations, even more so if the results come from averaging or voting on many (eg often 1000 or more) models.

Nevertheless, these ML models have proven to be very powerful in passive applications – eg. deep learning NN’s for image analysis, and Random Forests for medical diagnosis, both of which are passive applications (where there is no interest in altering the image or the patient).

Latent Variable models such as PLS (Partial Least Squares or Projection to Latent Variables) were developed specifically to handle “BIG Data” where the real number of things affecting the process is much smaller than the number of measured variables. Typically, the number of latent variables needed to extract useable information from hundreds of process variables is more in the order of 3 to 10, implying that the true number of degrees of freedom that affect the process is often quite small.

These latent variable models such as PLS or PCR (Principal Component Regression) are a total break from the classic statistical regression of machine learning models in that they assume that the input or regressor (X) space and the output (Y) space are not of full statistical rank and so they provide models for both the X and Y spaces rather than just the Y space. Without simultaneous models for both the X and Y spaces, there can be no model uniqueness nor any model interpretabilty or causality when using historical data. LV models do provide causality, but only in the reduced dimension space of the latent variables as this is the only space within which the historical process has varied. By moving the latent variables one can reliably predict the outputs (Y) – implying causality for the latent variables. However, to move the LV’s one cannot just adjust individual x variables, rather combinations of the X variables that define the LV’s as defined by the X space model.

ProSensus has used this uniqueness and causality provided by LV models for many years to trouble-shoot and understand processes and to optimize both processes and products based on actionable information extracted from “BIG” historical data.