(229c) Sustain-Gpt: A Large Language Model for Creating a Structural Database from Unistructural Text Resources to Develop Life Cycle Inventories | AIChE

(229c) Sustain-Gpt: A Large Language Model for Creating a Structural Database from Unistructural Text Resources to Develop Life Cycle Inventories

Authors 

Bakshi, B., Ohio State University
Technological advancement, aimed at improving the quality of life for humankind and equitable access to resources for all, often has an adverse impact on nature. The systematic accounting of such impact in the design and deployment of technologies is, however, a recent development [1]. Life-cycle assessment (LCA) is the methodology most commonly used to assess the overall environmental impact of a technology or product across all steps in its value chain. LCA thus serves as a tool to compare the impacts of an innovation over conventional alternatives, as well as between multiple innovations [2]. One of the most crucial and tedious steps in LCA is collecting datasets for subsequent analysis [3]. These datasets are a combination of background and foreground process data, and thus necessitate domain knowledge of the process being studied as well as access to life-cycle inventories (LCI). LCIs vary in the granularity of data included, depending on the ubiquity of the process, as well as the spatial information covered. Some sources of inventories include EcoInvent, USLCI, GaBi etc. Additionally, the scientific literature on LCA studies boasts a rich source of inventory data, which may or may not be covered in background datasets. The source of these inventories and, hence, the quality of the data vary across articles, thus introducing uncertainty.

These unstructured datasets constitute ~80% of all available life-cycle inventories and exist in the form of text, tables, and images. Systematic mining can potentially exploit the size of these datasets and their hidden knowledge. Natural language processing (NLP), a subfield of machine learning (ML), is a suitable candidate for this application. The breakthrough in the field of NLPs was the introduction of transformer architecture in 2017 [4]. This was followed up by the introduction of two contributor models: the Bidirectional Encoder Representation of Transformer (BERT) [5] and the Generative Pre-trained Transformer (GPT) [6]. Recently, BERT, a pre-trained model trained on a large corpus with a hyperparameter count range of 110M to 340M, was further fine-tuned on a custom dataset using a transfer learning approach. Since then, BERT has been applied in such niche scientific fields as SciBERT [7] , BioBERT [8], PatentBERT [9] , and Recycle-BERT [10]. Furthermore, the GPT model has also been utilized in various applications such as BioGPT[11], GPT-GNN[12], and versions of GPTs [13], [14] . A real-life application of GPT-3 is ChatGPT, which works on questions and answers and text generation tasks. The transformer-based models are called Large Language Models (LLMs) if their hyperparameter counts are in billions. In 2023, the largest-sized LLM was launched by Meta, Llama-2, with hyperparameter counts ranging from 7B to 175B [15].

In this work, we are leveraging the capabilities of the Llama-2 model and proposing Sustain-GPT, a large language model that would be pre-trained and fine-tuned on research articles. It will help us retrieve quantitative and qualitative information from chemical process-based literature. The extracted knowledge can be utilized in various problem-solving approaches by creating databases such as custom LCI databases, reaction databases of chemical processes with uncertainty in values. As a preliminary result, we are showcasing the results from Sustain-GPT in the case of LCI subset creation for the methanol synthesis process. We are utilizing the full-text research articles and abstracts extracted from the Elsevier database using an API key and institute token with the simple, specific keyword "life cycle assessment for methanol synthesis ." The total number of full text articles downloaded is ~5k. Figure 1(a) shows the trend of number of reliable publications in the last 24 years. We pre-trained the Llama-2 model on our custom corpus, which is the full text of research articles. Further, we fine-tuned our custom pre-trained model for a question-and-answer task. Additionally, it was encapsulated in the form of a chatbot framework using the open-source Python library Gradio. To fine-tune the model, we generated a total of ~500 pairs of questions and answers related to the field of methanol LCA. The learning curve is shown in Figure 1(b); it reflects how loss values are decreasing with steps, and the proximity of the lines indicates the perfect fitting nature of the model. The model tested and retrieved quantitative information related to LCI creation for the functional unit of 1 kg of methanol by asking a set of questions to the Sustain-GPT model. The responses are compared with the standard LCI database and tabulated in Table 1. The results are close to standard data. Thus, the preliminary results are promising.

Sustain-GPT's scope covers quantitative knowledge extraction for industrial processes used to synthesize various chemicals. The variables of interest include product yield, process costs, reagent use, etc. towards populating the input and output flows for an inventory. To do this, we will retrain our proposed model on reliable literature for various chemical synthesis processes. The overarching LCI data retrieval system will consist of super prompts. These super prompts, essentially a combination of sub-prompts, progressively populate the various fields of an LCI. Thus, the creation of an LCI only requires a single super prompt from the user. Additionally, this will allow the compilation of foreground LCI for hitherto uninvestigated processes. Finally, the sourcing of data from multiple publications will yield a range of values for the variables and a sense of associated uncertainty. This will allow users the freedom to populate the input and output flows for their LCA study.

References

[1] R. G. Hunt and W. E. Franklin, “LCA - How it Came about - Personal Reflections on the Origin and the Development of LCA in the USA,” International Journal of Life Cycle Assessment, vol. 1, no. 1, pp. 4–7, 1996, doi: 10.1007/BF02978624.

[2] W. Heijungs, ‘Environmental life cycle assessment of products-Guide’, Technical Report of Centre of Environ. Sci., 1992.

[3] R. Heijungs and S. Suh, The computational structure of life cycle assessment. 2002. Accessed: Apr. 05, 2024. [Online]

[4] A. Vaswani et al., “Attention Is All You Need.”

[5] J. Devlin, M.-W. Chang, K. Lee, K. T. Google, and A. I. Language, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” [Online]. Available: https://github.com/tensorflow/tensor2tensor

[6] A. R. Openai, K. N. Openai, T. S. Openai, and I. S. Openai, “Improving Language Understanding by Generative Pre-Training.” [Online]. Available: https://gluebenchmark.com/leaderboard

[7] I. Beltagy, K. Lo, and A. Cohan, “SciBERT: A Pretrained Language Model for Scientific Text,” EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 3615–3620, Mar. 2019, doi: 10.18653/v1/d19-1371.

[8] J. Lee et al., “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” academic.oup.com, Accessed: Apr. 04, 2024. [Online]. Available: https://academic.oup.com/bioinformatics/article-abstract/36/4/1234/5566506

[9] J.-S. Lee and J. Hsiang, “PatentBERT: Patent Classification with Fine-Tuning a pre-trained BERT Model,” May 2019, Accessed: Apr. 04, 2024. [Online]. Available: http://arxiv.org/abs/1906.02124

[10] A. Kumar, B. R. Bakshi, M. Ramteke, and H. Kodamana, “Recycle-BERT: Extracting Knowledge about Plastic Waste Recycling by Natural Language Processing,” ACS Sustain Chem Eng, vol. 11, no. 32, pp. 12123–12134, Aug. 2023, doi: 10.1021/acssuschemeng.3c03162.

[11] R. Luo et al., “BioGPT: generative pre-trained transformer for biomedical text generation and mining,” Brief Bioinform, vol. 23, no. 6, pp. 1–11, Nov. 2022, doi: 10.1093/BIB/BBAC409.

[12] Z. Hu, Y. Dong, K. Wang, K. W. Chang, and Y. Sun, “GPT-GNN: Generative Pre-Training of Graph Neural Networks,” Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1857–1867, Aug. 2020, doi: 10.1145/3394486.3403237.

[13] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners”, Accessed: Apr. 04, 2024. [Online]. Available: https://github.com/codelucas/newspaper

[14] T. B. Brown et al., “Language Models are Few-Shot Learners,” Adv Neural Inf Process Syst, vol. 2020-December, May 2020, Accessed: Apr. 04, 2024. [Online]. Available: https://arxiv.org/abs/2005.14165v4

[15] H. Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” Jul. 2023, Accessed: Apr. 04, 2024. [Online]. Available: https://arxiv.org/abs/2307.09288v2

[16] P. Levi, J. C.-E. science & technology 2018, “Mapping global flows of chemicals: from fossil fuel feedstocks to chemical products,” ACS Publications