(87c) Text Data Feature Extraction Via NLP Embeddings Methods: Robustness and Power Assessment | AIChE

(87c) Text Data Feature Extraction Via NLP Embeddings Methods: Robustness and Power Assessment

Authors 

Strelet, E., University of Coimbra
Wang, Z., Dow Inc.
Peng, Y., The Dow Chemical Co
Rendall, R., University of Coimbra
Chin, S. T., The Dow Chemical Company
Reis, M., University of Coimbra
A large variety of sensors and measurement instruments are available nowadays in Chemical Processing Industries (CPIs). Using this wide spectrum of sensor technology, it is possible to measure or to infer crucial process parameters for monitoring and control purposes [1]–[3]. However, the coverage of the relevant process information is still limited. Even with the existing variety of instrumentation available, the coverage of sensing instruments is physically constrained to a sample or a given section / area of the process / reduced set of physical quantities. Also, the pre-existent instrumentation, sometimes is not enough to measure or estimate new parameters of interest or to detect some abnormal phenomena. For example, existent leaks, corrosion, insulation degradation, unplanned events, etc., are not usually possible to measure with existing sensor technology.

Even though the measurement instrumentation diversity is increasing, the sensors are not the only data sources existing in the CPIs databases. The text data provided from reports, alarms, process tags, etc. are potential interesting and diverse sources of information. These data can contain relevant aspects that sensors are not able to capture. Proper handling of process text data can therefore bring more information for process diagnosis, monitoring and control.

With the recent advances in Natural Language Processing (NLP) [4]; new methods are available that allow to extract features from text data beyond simple frequency counting. The semantics, i.e., the meaning of the text can also be codified in a structured numerical feature, which can be used for process analysis. However, the understanding of a given NLP model is still quite complex, and they are essentially used as black-boxes. Additionally, the power and robustness of this kind of models is still not explored in the CPI context. Therefore, we explore several NLP models for text embedding task, in the scope of a real process, in order to perform an exploratory analysis of the information content and potential associated value for process tuning [5]. Dimension reduction [6] and clustering [7] methods were used to assess the methods and derive several robustness and power metrics.


References

[1] C. H. Goh, «Representing and reasoning about semantic conflicts in heterogeneous information systems», Thesis, Massachusetts Institute of Technology, 1997. Acedido: 23 de outubro de 2019. [Em linha]. Disponível em: https://dspace.mit.edu/handle/1721.1/10713

[2] V. Sheokand e V. Singh, «Modeling Data Heterogeneity Using Big DataSpace Architecture», em Advanced Computing and Communication Technologies, vol. 452, R. K. Choudhary, J. K. Mandal, N. Auluck, e H. A. Nagarajaram, Eds. Singapore: Springer Singapore, 2016, pp. 259–268.

[3] M. S. Reis, R. D. Braatz, e L. H. Chiang, «Big Data - Challenges and Future Research Directions», Chemical Engineering Progress, n.o Special Issue on Big Data(March), pp. 46–50, 2016.

[4] D. Antons, E. Grünwald, P. Cichy, T. O. Salge, e T. O. Salge, «The application of text mining methods in innovation research: current state, evolution patterns, and development priorities», R & D Management, vol. 50, n.o 3, pp. 329–351, jun. 2020, doi: 10.1111/radm.12408.

[5] K. Lu, A. Grover, P. Abbeel, e I. Mordatch, «Pretrained Transformers as Universal Computation Engines». arXiv, 30 de junho de 2021. Acedido: 1 de setembro de 2022. [Em linha]. Disponível em: http://arxiv.org/abs/2103.05247

[6] L. McInnes, J. Healy, e J. Melville, «UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction», arXiv:1802.03426 [cs, stat], 2018, Acedido: 12 de outubro de 2020. [Em linha]. Disponível em: http://arxiv.org/abs/1802.03426

[7] L. McInnes e J. Healy, «Accelerated Hierarchical Density Clustering», em 2017 IEEE International Conference on Data Mining Workshops (ICDMW), nov. 2017, pp. 33–42. doi: 10.1109/ICDMW.2017.12.