(38c) Analysing Big Data in Dairy Processing, By Throwing Most of It Away

Depree, N., University of Auckland
Young, B. R., University of Auckland
Prince-Pike, A., University of Auckland
Wilson, D. I., AUT University
Dairy processing in New Zealand has a desire to reduce out-of-specification production of value-added milk powders. While the process is tightly controlled, and such events are rare, the detection of some faults takes up to three days post-production, and leads to significant downgrades given very high production rates. Grading tests indicate if powders are fit for release, but are not clear measurements useful for data analysis. It is hard to identify the plant operating conditions that lead to failure, and furthermore, the underlying physical mechanisms are not always well understood.

The examination of multiple industrial plants spread over geographically separate sites, all producing similar products requires careful management of significant amounts of data. Ideally this “Big Data” approach would underpin models describing the process, which could be used to predict key functional properties of the milk powder, and deliver early warning signals before product was to slip out of specification. Extensive work was done in the very difficult task of collecting and aligning several years of process and quality data, from plants around New Zealand, representing a range of designs, ages, geographical locations, process control schemes, and data storage types.

However, as recent publications have shown, and this study reinforced, the creation of the dataset is typically a much larger job than the actual modelling and analysis, which at times can seem as trivial as sending the data to a regression function. It is becoming increasingly evident that advanced technical knowledge is not simply waiting to be unlocked by the “Big Data Revolution”. Our early approaches led to a range of models that suffered from poor predictability and were of little practical use. It appeared to be a combination of several factors, where measurements to discern the key physical phenomena were not available (or possible), and that there was not enough data in the main regions of interest, due to the rarity of failure.

Whilst adding new instruments is difficult and expensive, substantial improvements in predictability were made, not by collecting even more data, but by “throwing much of it away”. Changes in plant operations, equipment, and products mean that only `small data’ models can be applied to subsets of the data. It was necessary to go back to first principles and apply detailed operational knowledge to find ways to improve the time alignment of process data and quality samples. Statistical resampling of data sets was applied to give more weight to rare occurrences, as well as using categorical modelling to find these occurrences.

This approach, and the need for fundamental engineering information and careful examination of the physical process was reinforced by investigation of a different big data set for cream cheese production. The data was analysed to give an overview of the behaviour of the plant, but progress in modelling and prediction required significant work and low-level calculations to dissect the data. Rather than multivariate data analysis, this required actual engineering knowledge to understand how the plant is automatically controlled, how the operators physically conducted manual tasks, and the vagaries of how the different instruments measure and are recorded.

This paper will describe two examples of how “Big Data” - that is the collection and preparation of the data - is not a solution to our problems, rather than a tool which still needs careful investigation and engineering knowledge to understand engineering problems. It is the unfortunate situation that often the biggest data still does not contain enough information, either because key information is not measured, or because there is not enough data in the regions we are really interested in. In this case, models that are intended to predict certain failures need to have substantial data that exhibits these failures. We describe this problem and how we were able to address it.