(64b) Challenges and Benefits of Aligning and Reconciling Process Data for Seasonal Processing Industries

Authors: 
Depree, N., University of Auckland
Young, B. R., University of Auckland
Wilson, D. I., AUT University
Boiarkina, I., University of Auckland
Prince-Pike, A., University of Auckland
Croy, R., Fonterra Co-operative Group Limited
Industrial chemical manufacturing plants produce both chemical products and data, the latter in increasingly large amounts. While the former is obviously the raison d’être of the plant, it is becoming increasingly evident that the data both has intrinsic worth, and is often undervalued. The data originates from many sources such as pressure/flow/temperature transducers throughout the unit operations, records produced by analytical chemists in adjacent laboratories of day-old product, external environmental and/or financial conditions, and perhaps feedback from customers, partially filtered by the sales and marketing division. In many instances, all this heterogeneous data (at times of dubious quality) is stored in separate and incompatible databases behind imposing barriers. This makes it hard to draw any meaningful conclusions from this wealth of data. And this is just for one plant, what would happen if one wanted to compare across multiple sites?

Over twenty years ago, the advantages of fully exploiting captured data was recognised, and mathematical optimal strategies for producing a reconciled and consistent data set were proposed by a number of authors. Typically these strategies were cast as constrained optimisation problems, where the intent was to produce a set of adjusted process measurements that satisfied known mass and energy balance constraints. While academic interest seems to have waned somewhat, such approaches are still relevant, and plausible even with the large data sets routinely generated today. However simply constructing a reconciled data set is only a first step in what is potentially possible.

This work describes the construction of a reconciled and aligned database of three separate large-scale milk powder plants in New Zealand. The production of milk powder has some special challenges, compared to say bulk chemicals. First, given that the raw material (from cows) is a biological product of eating grass, in different geographical locations, there are many uncontrolled, and strongly correlated seasonal variables. Second, milk production in New Zealand is largely seasonal and typically only operates for 9 months of the year with an annual 3 month period used for sometimes significant plant upgrades or changes. Consequently, data captured in one season could be very different to the next, due to seemingly small plant changes. Finally, quantification of the end-properties of the milk powder are subtle, challenging to measure repeatably, and can be hard to relate to production conditions due to the complex multivariate nature of the underlying physical-chemical relations.

This motivation behind the creation of this database is to investigate the potential of ‘Real Time Quality’ in the production of Instant Whole Milk Powder (IWMP). Increasingly across the dairy industry, the focus of producers is shifting from maximising production to maximising quality, and higher-value milk powders and premium products, which necessarily also have higher requirements in terms of performance and composition. This covers a very broad range of attributes, including dozens of measurements each of physical, functional, sensory, and microbiological factors, not all of which are explicitly controlled at the time of manufacture. Functional properties, such as the dissolution behaviour, taste, or texture - given their inherent qualitative characteristics - are challenging to control and quantify accurately, and they are often not tested regularly. The determining physical causes are not always well known, and plant operators may rely on rules of thumb, or may not even have a chance to affect functional property outcomes if the test results are not timely. However it is advantageous that future quality control is performed in real time to prevent many tonnes of off-specification powder being produced before detection by the infrequent and delayed offline measurements.

It was hypothesised that multivariate regression utilising a very broad and deep dataset, containing many process and quality variables from several different powder plants producing the same premium product, across several years of production seasons, may give sufficiently varied input data to allow prediction of powder functional properties, suitable for real time decision making. This avoids the problem of insufficient excitation in the raw data from plants operating at steady-state to make meaningful models. This firstly required construction of a dataset, which was a vast undertaking to combine and align many sources of data, spread over time, geography, and changes in plant design and operating methods.

This approach distinguishes between three distinct types of measured data: X data, which comprise of the standard process measurements such as temperature, pressure, and flow; Y data, which comprise of at-line hourly measurements of in-process powder physical properties; and Z data, which are the final powder functional properties. The X data are typical measurements that exist in any process plant, however Y data are key physical powder measurements, such as fat, protein, and moisture content. The Z data are typically prescribed by customers or regulatory bodies, and may be challenging to change or improve, however Y data may be expanded with newly introduced measurements, such as bulk density or particle size distribution, which it is hypothesised will improve prediction of functional properties by bridging the gap between measured process data and final powder properties.

These three data groups are stored in different databases, and not always with clear methods to cross-reference them. A further complicating issue is that the powder transport chain also creates a significant challenge where the actual production time of any Z-data quality sample taken from packed powder bags cannot be confidently known, due to holdup time and mixing of powder from large blending bins and silos.

Construction of the database required close examination of the specific operating and sampling processes at each plant, and their associated data storage methods. This uncovered a large number of special cases in the data streams, which must be detected and cleaned up in every case, which requires time consuming manual programming, and is not easily generalised to other plants. This is a fundamental feature of many types of industrial data however, and the cleaning up of data idiosyncrasies is a key requirement to create a useful dataset, which would greatly benefit from development of advanced or smart data processing methods. It is also, in our experience, a key reason why properly aligned data is so rarely constructed.

Missing data is a further key challenge to surmount, as the routine plant cleaning cycles cause a significant fraction of plant data to be out of range at all times. For example in a plant with four parallel milk evaporator trains, one or more are always out of service for cleaning, and the process temperatures recorded are either the ambient temperature, or that of the cleaning fluids, and must be removed from the data set when comparing process conditions. Missing data can be imputed, which may have an effect on the behaviours observed, or alternatively samples with any missing data in their row can be dropped. However the dropping of entire rows causes severe data reduction at times, up to 100% of the data in plants with parallel unit operations where one is always offline, and clearly such an approach is untenable. Consequently other methods have been proposed such as creating composite variables from parallel unit operations, with some success.

Carefully framing the questions to be asked of the data is a key factor in preventing unnecessary data reduction, by informing the optimal data arrangement or imputation. This planning was also vital in selecting which parts of the data to use for predictive modelling. Changes in plant equipment or operating methods cause structural changes in the data set and change or break underlying relationships. For example, the full data set can be used to find similarities and differences between the three plants producing the same high-value product, however there are significant differences in plant design between them, and they do not have the same relationships between process conditions and product quality. Product quality is much better investigated when using data from only one plant at a time, and produces models which may be used to inform process conditions particular to that plant.

While significant work has been undertaken in prediction of functional properties, there is also great value produced solely by acquiring and organising this dataset, which had not been previously undertaken due to the difficulty and effort required. Visualisation of this plant data using novel methods or arrangements can elucidate many behaviours, differences and similarities between the different plants, products, and times of the season. Data visualisation is a rich subject with great value, which is rarely in our experience ever well explored or exploited by plant process control engineers. In order to generate information from rich visualisation however, of course the appropriate dataset needs to be collected and aligned, which may be a prohibitive undertaking without careful planning, design and intelligent processing methods.

The paper will illustrate uses of this data set, showing visualisation strategies for time-varying multivariate models, and how plant engineers can quickly assess the current plant status compared to past runs, as well as work in finding predictive models for use in Real Time Quality production.