(49a) Toward Rapid Chemical Process-Based Life Cycle Inventory Data Generation: From Modeling Frameworks to Simulation to Machine Learning | AIChE

(49a) Toward Rapid Chemical Process-Based Life Cycle Inventory Data Generation: From Modeling Frameworks to Simulation to Machine Learning


Parvatker, A. - Presenter, Northeastern University
Eckelman, M. - Presenter, Northeastern University
Commercial LCI databases account for less than 1% of the more than 80,000 chemicals available in the US market alone. This large gap signals the need to estimate or directly collect LCI data for the large majority of chemicals that are currently not represented. This abstract covers findings from multiple projects with the overall aim of providing LCA practitioners with data sets and tools that will enable accurate estimation of chemical LCI data based on chemical engineering design principles. The work includes detailed comparison of existing estimation methods; application of those methods to generate a large, harmonized LCI data set for organic chemicals; and application of statistical and machine learning methods to build a model for LCI data generation.

In the absence of actual data from production plants or industrial process information, material input for the LCI is usually estimated based on balanced stoichiometric reactions. These also provide information on the output from the processes in form of co-products and emissions. Additional process related details such as temperature of reaction, separations operations, and properties of chemicals involved can and should be used for estimating the process energy usage. Chemical process data can often be found in form of laboratory scale experiments in patent literature. Scale-up considerations when using such data have been explored by Piccinno et al., while Smith et al. presented a methodology combining process simulation and synthetic organic chemical manufacturing industry (SOCMI) emission factors for generating detailed chemical LCIs. Others have proposed methodologies for specific sectors such as bio-based chemicals, fine and specialty chemicals, and pharmaceuticals to address the scarcity of LCIs in these sectors. Other methods including proxy LCIs, similarity-based approaches, and molecular structure-based models (MSM) capitalize on the existing chemical LCIs in different databases to estimate missing LCI data for other chemicals. While proxy method assumes that similarity in processes would imply similar LCI input based on expert judgment, similarity-based approach presented by Hou et al. applied computational data analysis to accurately estimate missing LCI data when less than 5% data is missing in a process. Molecular structure-based models developed by Wernet et al. and Song et al. estimate life-cycle impact assessment results, rather than inventory data.

All such methodologies for building chemical LCIs are based on varying data availability, and were explored in terms of their accuracy against actual plant data. The global warming results for “shortcut” methods such as stoichiometry and proxy LCIs were 35-50% lower than that of LCI based on actual plant data. Whereas the process-based methods performed much better with the difference of 1-10% compared to plant LCI in results for the greenhouse gas impact category. The energy data obtained from process simulation was closest to the actual plant data, which emphasizes the need for mechanistic models that capture the complexities of actual chemical plant operation.

Measuring energy consumption for each process in a chemical plant is not always possible due to complex plant geometries, with return flows, batch sequencing, split process streams and utility consumptions are available only at plant level. Jiménez-González et al., in their seminal work use traditional chemical process design principles and rules of thumb in developing a methodology for gate-to-gate LCI data for chemicals. This methodology is later used by Kim et al. to estimate gate-to-gate manufacturing energy data for 43 organic and 43 inorganic chemicals. Some of these results are still used in the ecoinvent database. Other disaggregated LCI datasets in the ecoinvent database use estimations based on design calculations or energy for similar processes, other literature sources, and in rare cases data based on industry sources. Several other disaggregated chemical datasets use an average value for heat and electricity based on energy consumption for manufacturing chemical products at the GENDORF Chemical Park. For a multi-product batch plant, a bottom-up approach is more appropriate, wherein process energy is calculated based on a model for each unit operation and aggregated to determine total energy.

In our work we demonstrate a streamlined process simulation-based methodology to estimate energy consumption in chemical manufacturing when the data availability for the process is limited. The methodology is applied to 151 different chemical processes using Aspen Plus to estimate their gate-to-gate process energy use, representing the largest such simulation-based LCI dataset to date. Further, pinch analysis is used for process heat integration instead of rules of thumb. Results for different utility types for each of the chemical processes further enhances the information provided by the LCI data.

The total heating requirement for chemicals assessed ranges from 0.1-24 MJ/kg with an average of 3.1 MJ/kg product, while the average cooling requirement before heat integration is 4.5 MJ/kg. More than half of the total energy requirement comes from the separation section. Steam is the most used hot utility in the chemical industry which reflects in the simulation results, accounting for 70% of the heating requirements, while air and cooling water accounts for 40% of the cold utilities used. The results are closely correlated with the subset ecoinvent and other existing chemicals LCI data that is based on process modeling, but do not match with existing data that have used simpler estimation methods. This data set represents the largest, harmonized set of chemicals process energy data. In the final project, we utilized this data set to build predictive models.

Use of statistical and machine learning methods for prediction of life-cycle inventory data has so far been limited due to unavailability of chemical production process data required for calibration/training but is beginning to be explored. Classification and regression tree (CART), and Random Forest (RF) methods have been used to estimate emissions from (rather than inputs to) chemical processes. The symmetric mean average percentage error for the CART and RF models estimating process emissions on out of sample data were 80% and 92%, respectively. Pereira et al. used classification and regression trees to predict the range of steam consumption in batch chemical processes using predictor variables such as reaction type, reaction time, and presence of distillation operation in the process. The study also presents models using probability density functions (PDFs) fitted on 250 data points obtained from batch chemical processes, classified based on their reaction types. The accuracy of these PDF models evaluated on an additional industrial dataset with 17 data points was 40% while for classification trees it ranged from 35% to 80% for five different models with varying process inputs as classification rules.

These prior studies address specific LCI data such as emissions and energy use and are limited by its application to certain unit operations such as distillation, and chemical reaction. There is therefore a compelling need for computational models that use process parameters and chemical properties that have physical significance in the manufacturing process, rather using molecular descriptors of only the target molecule to predict the total energy use in chemical manufacturing, particularly as 20 of the top 26 chemicals in energy consumption and GHG emissions are organic chemicals manufactured with continuous processes.

Using our simulation-based dataset, we selected a mix of predictor variables including process parameters, thermodynamic properties of reactants and products, and reaction types, to develop multiple linear regression (MLR) and artificial neural network (ANN) models for the predictive analysis for both heating and cooling. Examining model performance over ten different runs, the mean R2 for T_heat and T_cool was 0.65 and 0.6, respectively whereas the mean absolute percent error (MAPE) was 0.39 and 0.56, demonstrating that the selected models perform equally well when trained and tested on different subsets of the data. Some models performed with R2 values up to 0.82. Further investigation of test data from cross-validation runs showed that one-third of the predicted test data for T_heat had a percentage error of less than 25%. Processes with high heating requirements such as furan (23.7 MJ/kg) and methyl methacrylate (12.7 MJ/kg) due to higher energy in the separation section were underestimated by 10 to 30%. Different chemical reaction types were also investigated and incorporated into the predictive modeling.

Estimating total heating and cooling requirements based on physically relevant predictors with an error of less than 50% for more than 70% of the predictions is a significant improvement for current LCIs that are using single value for all organic processes based on proxy or historical plant data. A major challenge remains to test the accuracy of these simulation-based models against industrial data on a wide range of chemicals to train these models to be truly representative of real-world conditions.

We are at a turning point in LCA where advanced data science and machine learning approaches are being introduced, allowing us to model complex processes and products with high resolution. These new approaches will greatly benefit from consistent, abundant LCI data that accurately represent chemical manufacturing practices. Our efforts resulted in a model that is an improvement on existing ones, but did not reveal a truly well-performing set of predictive parameters or modeling approach. As a community, we should discuss opportunities for new data streams, data fusion techniques, and the relative benefits of mechanistic versus statistical approaches.