(362e) A Comparative Evaluation of Machine Learning Algorithms in Predicting Syngas Fermentation Outcomes Using Limited Experimental Data | AIChE

(362e) A Comparative Evaluation of Machine Learning Algorithms in Predicting Syngas Fermentation Outcomes Using Limited Experimental Data

Authors 

Wan, N., Washington University in Saint Louis
Wen, Z., Iowa State University
Tang, Y., Washington University in St. Louis
Background and Motivation. Syngas fermentation produces C2~C4 acids and alcohols from the consumption of diverse gaseous substrates. Since the prediction of gas fermentations via kinetic models has historically been challenging, this study applies and compares machine learning approaches. Kinetic models have struggled to capture the complex behavior of syngas fermentations which can be affected by gas-to-liquid mass transfer, cell biosynthesis capability, the syngas compositions/flowrates, product inhibitions, and metabolic shifting between acetogenic and solventogenic stages. Semi-empirical power-law models have been developed to address the shortcomings of kinetic models, but the equations in these models tend to be stiff when simulating new conditions. Machine learning (ML) has emerged as a viable black-box method for discovering novel relationships in multivariate systems. ML can predict complex cellular processes without mechanistic equations that explicitly link input and output variables [1]. This study aims to address three questions: how can limited fermentation data be used to support quality ML predictions? Which ML algorithm is the best for syngas fermentation predictions? How can our analysis guide future ML projects?

Methods. Time course concentration data from Clostridium fermentations [2] was used to predict individual product production rates and time course concentration curves. For each time point, the state of the fermentation (gas composition and extracellular metabolite concentration) was paired with the production rates of acetate, ethanol, butyrate, and butanol. Via data augmentation, a database of 836 time points was constructed for supervised learning algorithms. This database was split into test data and training data, and was used to train six ML algorithms: neural networks (NNs), support vector machines (SVMs), random forests (RFs), elastic nets (ENs), lasso (LA), and k-nearest neighbors (kNN). Additionally, the rate predicting algorithms were used to generate time course concentration data by starting with initial conditions and iteratively calculating the concentration of each product at the next time point.

Results. Based on unseen testing data, the predictions of acid productivity (acetate and butyrate) were more accurate than for alcohol (ethanol and butanol) productivity. The predications of two carbon products were more accurate than those of four carbon products. A trend in our findings is that products that require more enzymatic steps or more cofactors have less accurate predictions.

For test set rate predictions, RF performed the best with SVM being a close second. Both algorithms had average R2 values of ~0.35. EN and LA had moderate performance with average R2 values of ~0.30, while NN and kNN showed the worst average performance with R2 values of ~0.22. EN and LA are relatively simple algorithms with fewer fitted variables than the other ML methods. The fact that they outperformed kNN and NN indicates that kNN and NN likely were overfit. Despite NN’s overall poor performance, it offered the best predictions for ethanol production rate. This indicates the performance of a ML model will not be uniform across syngas fermentation products, and therefore the selection of ML algorithms should be made only after testing multiple options.

Interestingly, the time course curve generated by the production rate models offered predictions that were more accurate than the rate predictions themselves. SVM, RF, EN, and LA were the most accurate with test set R2 values of ~0.80. NN and kNN performed less well with test R2 values around 0.5. Potentially, this is because kNN models rely heavily on the training set because the algorithm uses the most similar points in the training data to predict testing data. As a result, in this study kNN models tended to be less ‘generalizable’ than the other models. NN’s lower performance is likely because neural networks have many fitted parameters, and therefore can overfit. Both issues could be resolved with a larger training data set, or by using a training set that more closely resembles the testing set.

The trained random forest models were used to determine the relative weight of the gas components on the production rate of the four products. The feature importance of a gas on a product’s production rate was determined by averaging the impurity reduction when the value of the gas was used to split the decision trees. The analysis shows that butyrate’s production rate is heavily dependent on the concentration of CO in the feed gas, and that butanol’s production rate was mainly dependently on the concentration of H2. This follows previous findings since CO is both a carbon source and an energy source, while H2 offers strong reducing power. H2 is the most influential substrate for butanol production since its synthesis requires more reducing cofactors than the other products. These feature analyses show how machine learning methods can use limited experimental data to ‘relearn’ and ‘redesign’ biosynthesis patterns.

Implications.

Syngas fermentations are highly dynamic and nonlinear, which make them ideal targets for ML based Model Predictive Control (MPC). This study evaluated six ML algorithm’s ability to predict syngas fermentation production rates based on limited fermentation tests. SVM and RF performed best in this study while kNN and NN performed the worst. In contrast, the simpler ML algorithms, EN and LA, are ‘safer’ options because they have less variables. In this study, EN and LA outperformed NN likely because of the limited training set size rather than linear methods being more applicable to syngas fermentation. Generally, ML methods were more accurate for acid production rates than for alcohol production rates indicating that there are unknown features not captured for alcohol productions (e.g., metabolic shifts or other intrinsic biological factors). Time course predictions based on rate predictions were more accurate than direct rate predictions. Additionally, feature importance reaffirmed guidelines for how gas composition can be used to control product profiles. Future studies can build off this work by increasing the amount of syngas fermentation data, including new features to capture cell regulations and stress responses to bioreactor conditions, or by applying ensemble machine learning approaches.

  1. Beltramo, T., Ranzan, C., Hinrichs, J., & Hitzmann, B. (2016). Artificial neural network prediction of the biogas flow rate optimized with an ant colony algorithm. Biosystems Engineering, 143, 68-78.
  2. Wan, N., Sathish, A., You, L., Tang, Y. J., & Wen, Z. (2017). Deciphering Clostridium metabolism and its responses to bioreactor mass transfer during syngas fermentation. Scientific Reports, 7, 10090.