(241e) Interactive Software for Teaching Multivariable Data Analytics | AIChE

(241e) Interactive Software for Teaching Multivariable Data Analytics

Authors 

Braatz, R. D. - Presenter, Massachusetts Institute of Technology
Schaeffer, J., TU Darmstadt
Latent variable methods such as partial least squares and principal component analysis are among the most widely applied data analytics tools for applications to chemical and biological processes. It can be challenging to teach these multivariable statistical methods in a way that the students gain enough conceptual understanding to be able to consistently apply these methods effectively in practice, especially considering that many chemical engineering programs do not require that their students take courses in linear algebra or applied statistics. The result is that many practicing chemical engineers use alternative methods instead, such as correlating a peak absorbance to each concentration when calibrating a spectral measurement, which can produce much lower accuracy models. Alternatively, the latent variable methods are often used in a black-box manner, without having enough understanding to know when a situation occurs in which the methods should not be applied, or how to revise the way that data are fed to the method to work around some imperfection in the data such as sensor bias.

Especially for students who have not had linear algebra or applied statistics training, the mathematical details of latent variable methods can be challenging to understand. The tendency will be for students to treat each method as a black box. After all, many software packages are available for applying latent variable methods, and it might seem that a more profound understanding is not needed for problem-solving. What is more, often multiple methods are applied to a problem at hand, and decisions for algorithms are made solely based on preliminary results. This approach makes it challenging to explain and interpret the models that are creating by these methods, in particular, in relating the models that are generated with the chemistry/biology occurring in the process, to reconcile the data analytics results with domain knowledge. It is the synthesis of domain knowledge with data analytics that is the added value of a well-trained chemical engineer and is likely to result in the best chemical engineering solutions for the particular problem at hand. Furthermore, preliminary black-box results can lead to choosing overly complicated models that overfit the data. Advancing the understanding and intuition on latent variable methods is needed to avoid overfitting for some types of biased data and ultimately assure model interpretability, leading to higher value, acceptance, and applicability.

This presentation describes software and examples that were developed to train students to achieve a deep understanding of latent variable methods [1],[2] and the related machine learning methods of lasso [3] and elastic net [4] (for the remainder of this abstract, these methods will be collectively referred to as latent variable methods, although some of these methods are more associated with the machine learning community). The graphical user interface was designed for the explicit purpose of teaching undergraduate and graduate students, which is a distinguishing feature from the graphical user interfaces in existing chemometrics software packages which are focused on just applying a method to a dataset. The software takes the perspective of the optimization being solved, so that the students can gain an understanding of the relationship between the latent variable method that is selected and the results that are produced.

This tool, referred to as Latent Variable Demonstrator (LAVADE), compares a wide range of latent variable regression techniques with traditional regression techniques on carefully designed examples. The examples are designed to be easy to understand, and various options to customize the problem are available to learn exactly how the different algorithms approach the model construction. Perturbing the signal step by a step with more noise fosters an understanding of how the different methods deal with noise. Furthermore, the tool allows the student to play and compete with the algorithms, making it exciting to gather knowledge and intuition to explain the algorithms’ behavior on real-world problems.

References:

  1. Leo H. Chiang, Evan L. Russell, and Richard D. Braatz. Fault Detection and Diagnosis in Industrial Systems. London, UK: Springer Verlag, 2000.

  2. K.V. Mardia. Multivariate Analysis. London, UK: Academic Press, 2003.

  3. Robert Tibshirani. “Regression Shrinkage and Selection Via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological), 58(1), pp. 267–288, 1996.

  4. Hui Zou and Trevor Hastie. “Regularization and variable selection via the elastic net.” Journal of the Royal Statistical Society. Series B: Statistical Methodology, 67(2), pp. 301–320, 2005.


Topics