(38c) Analysing Big Data in Dairy Processing, By Throwing Most of It Away
- Conference: AIChE Spring Meeting and Global Congress on Process Safety
- Year: 2018
- Proceeding: 2018 Spring Meeting and 14th Global Congress on Process Safety
- Group: Industry 4.0 Topical Conference
- Time: Monday, April 23, 2018 - 4:30pm-5:00pm
The examination of multiple industrial plants spread over geographically separate sites, all producing similar products requires careful management of significant amounts of data. Ideally this âBig Dataâ approach would underpin models describing the process, which could be used to predict key functional properties of the milk powder, and deliver early warning signals before product was to slip out of specification. Extensive work was done in the very difficult task of collecting and aligning several years of process and quality data, from plants around New Zealand, representing a range of designs, ages, geographical locations, process control schemes, and data storage types.
However, as recent publications have shown, and this study reinforced, the creation of the dataset is typically a much larger job than the actual modelling and analysis, which at times can seem as trivial as sending the data to a regression function. It is becoming increasingly evident that advanced technical knowledge is not simply waiting to be unlocked by the âBig Data Revolutionâ. Our early approaches led to a range of models that suffered from poor predictability and were of little practical use. It appeared to be a combination of several factors, where measurements to discern the key physical phenomena were not available (or possible), and that there was not enough data in the main regions of interest, due to the rarity of failure.
Whilst adding new instruments is difficult and expensive, substantial improvements in predictability were made, not by collecting even more data, but by âthrowing much of it awayâ. Changes in plant operations, equipment, and products mean that only `small dataâ models can be applied to subsets of the data. It was necessary to go back to first principles and apply detailed operational knowledge to find ways to improve the time alignment of process data and quality samples. Statistical resampling of data sets was applied to give more weight to rare occurrences, as well as using categorical modelling to find these occurrences.
This approach, and the need for fundamental engineering information and careful examination of the physical process was reinforced by investigation of a different big data set for cream cheese production. The data was analysed to give an overview of the behaviour of the plant, but progress in modelling and prediction required significant work and low-level calculations to dissect the data. Rather than multivariate data analysis, this required actual engineering knowledge to understand how the plant is automatically controlled, how the operators physically conducted manual tasks, and the vagaries of how the different instruments measure and are recorded.
This paper will describe two examples of how âBig Dataâ - that is the collection and preparation of the data - is not a solution to our problems, rather than a tool which still needs careful investigation and engineering knowledge to understand engineering problems. It is the unfortunate situation that often the biggest data still does not contain enough information, either because key information is not measured, or because there is not enough data in the regions we are really interested in. In this case, models that are intended to predict certain failures need to have substantial data that exhibits these failures. We describe this problem and how we were able to address it.