Big Data: Success Stories in the Process Industries | AIChE

You are here

Big Data: Success Stories in the Process Industries

Special Section

Big data holds much potential for optimizing and improving processes. See how it has already been used in a range of industries, from pharmaceuticals to pulp and paper.

Big data in the process industries has many of the characteristics represented by the four Vs — volume, variety, veracity, and velocity. However, process data can be distinguished from big data in other industries by the complexity of the questions we are trying to answer with process data. Not only do we want to find and interpret patterns in the data and use them for predictive purposes, but we also want to extract meaningful relationships that can be used to improve and optimize a process.

Process data are also often characterized by the presence of large numbers of variables from different sources, something that is generally much more difficult to handle than just large numbers of observations. Because of the multisource nature of process data, engineers conducting a process investigation must work closely with the IT department that provides the necessary infrastructure to put these data sets together in a contextually correct way.

This article presents several success stories from different industries where big data has been used to answer complex questions. Because most of these studies involve the use of latent variable (LV) methods such as principal component analysis (PCA) (1) and projection to latent structures (PLS) (2, 3), the article first provides a brief overview of those methods and explains the reasons such methods are particularly suitable for big data analysis.

Latent variable methods

Historical process data generally consist of measurements of many highly correlated variables (often hundreds to thousands), but the true statistical rank of the process, i.e., the number of underlying significant dimensions in which the process is actually moving, is often very small (about two to ten). This situation arises because only a few dominant events are driving the process under normal operations (e.g., raw material variations, environmental effects). In addition, more sophisticated online analyzers such as spectrometers and imaging systems are being used to generate large numbers of highly correlated measurements on each sample, which also require lower-rank models.

Latent variable methods are uniquely suited for the analysis and interpretation of such data because they are based on the critical assumption that the data sets are of low statistical rank. They provide low-dimension latent variable models that capture the lower-rank spaces of the process variable (X) and the response (Y) data without over-fitting the data. This low-dimensional space is defined by a small number of statistically significant latent variables (t1, t2, …), which are linear combinations of the measured variables. Such variables can be used to construct simple score and loading plots, which provide a way to visualize and interpret the data.

The scores can be thought of as scaled weighted averages of the original variables, using the loadings as the weights for calculating the weighted averages. A score plot is a graph of the data in the latent variable space. The loadings are the coefficients that reveal the groups of original variables that belong to the same latent variable, with one loading vector (W*) for each latent variable. A loading plot provides a graphical representation of the clustering of variables, revealing the identified correlations among them.

The uniqueness of latent variable models is that they simultaneously model the low dimensional X and Y spaces, whereas classical regression methods assume that there is independent variation in all X and Y variables (which is referred to as full rank). Latent variable models show the relationships between combinations of variables and changes in operating conditions — thereby allowing us to gain insight and optimize processes based on such historical data.

The remainder of the article presents several industrial applications of big data for:

  • the analysis and interpretation of historical data and troubleshooting process problems
  • optimizing processes and product performance
  • monitoring and controlling processes
  • integrating data from multivariate online analyzers and imaging sensors.

Learning from process data

A data set containing about 200,000 measurements was collected from a batch process for drying an agrochemical material — the final step in the manufacturing process. The unit is used to evaporate and collect the solvent contained in the initial charge and to dry the product to a target residual solvent level.

The objective was to determine the operating conditions responsible for the overall low yields when off-specification product is rejected. The problem is highly complex because it requires the analysis of 11 initial raw material conditions, 10 time trajectories of process variables (trends in the evolution of process variables), and the impact of the process variables on 11 physical properties of the final product.

The available data were arranged in three blocks:

  • the time trajectories measured through the batch, which were characterized by milestone events...

Would you like to access the complete CEP Article?

No problem. You just have to complete the following steps.

You have completed 0 of 2 steps.

  1. Log in

    You must be logged in to view this content. Log in now.

  2. AIChE Membership

    You must be an AIChE member to view this article. Join now.

Copyright Permissions 

Would you like to reuse content from CEP Magazine? It’s easy to request permission to reuse content. Simply click here to connect instantly to licensing services, where you can choose from a list of options regarding how you would like to reuse the desired content and complete the transaction.