# (246c) Simultaneous Canonical Polyadic Decomposition As a Data Fusion Algorithm to Develop Pseudo-Chemistry from Spectral Data

#### AIChE Annual Meeting

#### 2019

#### 2019 AIChE Annual Meeting

#### Topical Conference: Applications of Data Science to Molecules and Materials

#### Applications of Data Science in Catalysis and Reaction Engineering I

#### Tuesday, November 12, 2019 - 8:36am to 8:54am

**Simultaneous
Canonical Polyadic Decomposition as a Data Fusion Algorithm to Develop
Pseudo-chemistry from Spectral Data**

Keywords:

Data Mining, Tensor Decomposition, Data Fusion, Bayesian Networks, Reaction

pathway generation

In-line

spectral analyzers are popularly used to obtain molecular-level information as

they are fast, non-invasive, non-destructive, inexpensive and do not require

sample preparation. The process data from spectral analyzers are high

dimensional, multi-way, non-causal, non-full rank and have missing values. This

offers a challenge in using such process data to develop causal models that

hypothesize reaction pathways. This work uses the spectral datasets from

Fourier Transform Infrared (FTIR) and Proton Nuclear Magnetic Resonance (^{1}HNMR)

spectroscopy from the vis-breaking process of Cold Lake bitumen. The absorbance

data over wavenumber/ chemical shifts are collected across different

temperatures and residence time of reaction of the samples in the vis-breaker.

The two spectral measurements collected across the three modes viz.

temperature, residence time and absorbance/ chemical shifts represent multi-way

tensorial blocks of process data. The objective of this work is to develop a

data fusion framework that does not discount the tri-linear structure of the

tensorial blocks of spectral measurements, accounts for the missing spectral measurements

and the high-dimensionality of absorbance data across wavenumbers/ chemical

shifts while aiming to incorporate complementary information about the sample

from both. This is accomplished using the scheme of simultaneous canonical

polyadic decomposition (CPD) where the tensorial blocks of spectral data are

jointly factorized into independent factors in each mode while capturing

intermodal interactions during the decomposition, resulting in a unique

factorization scheme that is free from rotational and intensity ambiguities [1].

The

jointly mined process data using the data fusion algorithm is then used to

develop inferential models for monitoring the complex process of vis-breaking

by developing pseudo-reaction networks that hypothesize chemical pathways [2].

It is to be noted that the difficulty in analytical characterization of the

reacting mixture means that even the enumeration of all the major species

taking part in reactions is a significant challenge for conventional methods. This

is done using a probabilistic framework wherein the independent factors of a

mode obtained from the simultaneous CPD are viewed as random variables with a

multinomial distribution (the hyperparameters of which have a Dirichlet

distribution). Bayesian networks, which are probabilistic graphical models that

encode directed acyclic causal structures among the nodes of these random

variables are used to develop inferential hypotheses about chemical pathways

from the factors obtained from data fusion of spectral measurements. A directed

path exists between nodes if it maximizes the log likelihood, which is a

function of the mutual information and entropy, calculated using the

probability distributions of the random variables designated as nodes (factors

obtained from CPD). This amounts to using a score called the Bayesian information

criterion (BIC), which is the log likelihood of the entire network (pairwise

directed edges between nodes) penalized by the complexity of the network

(number of edges between nodes). Heuristic greedy search score-based methods that

make locally optimum choices while checking to see if directed edges between

pairwise nodes maximize the penalized log likelihood are used to obtain the

Bayesian networks i.e. the directed acyclic graph (DAG) encoding causal

relationships among the factors obtained from simultaneous CPD. During data

fusion using the simultaneous CPD, the independent factors in each mode are

constrained to be non-negative so that the decomposition is physically

meaningful by complying with the Beer Lambert law for spectral data, which

states that the absorbance is directly proportional to the concentration of the

components. Hence the independent factors in each mode from the data fusion

algorithm can be physically interpreted as representing a class of chemical

compounds (pseudo-component). The trilinear tensorial decomposition helps us

obtain the concentration of these pseudo-components in the modes of temperature

and residence time; while the third mode of wavenumbers/ chemical shifts

contain the spectral signatures of the corresponding pseudo-components.

The

number of independent factors in CPD is obtained using an important diagnostic

called the core consistency diagnostic [3],

whereby the number of factors in a CPD is obtained by fitting PARAFAC models

with arbitrary factors and then casting them into Tucker models, which should

produce an identity hypercube if the right number of components are used. This

technique exploits the fact that tensor decompositions are higher order

extensions of matrix singular value decompositions (SVD) and broadly fall into

two categories: Parallel factor decomposition (PARAFAC), which represents a

tensor as a sum of several rank 1 tensors, and Tucker decomposition, which is a

higher order SVD. The main difference between CPD using PARAFAC and Tucker is

that the number of factors is invariant across the modes in the former, making

it a restricted version of the latter. When this diagnostic was used on the

tensor blocks of FTIR and ^{1}HNMR data in this

work, the number of factors obtained was four. The spectral signatures of the

four pseudo-components obtained from the data fusion algorithm (Fig. 1) were

then used to develop Bayesian networks (Fig. 2).

*Figure 1. Spectral signatures of
pseudo-components obtained from the data fusion algorithm*

Figure

2. Bayesian networks developed from the spectral signatures of the pseudo-components

(PC)

The

analysis of the spectral signatures reveal the following: pseudo-component 1

(PC_{1}) mainly consists of carbonyl groups and cycloalkanes;

pseudo-component 2 (PC_{2}) consists of polyaromatics, alkoxy groups,

phenols, alkenes; pseudo-component 3 (PC_{3}) consists of aromatics,

alkanes and condensed products; and pseudo-component 4 (PC_{4})

consists of phenols, acyls and condensed aromatics. It

can therefore be hypothesized from the above Bayesian network structure that

the underlying chemical reaction pathways during the vis-breaking of bitumen

aim at obtaining more saturated end products through the free radical mechanism

of hydrogen radical addition. However, the longer chain aliphatics crack to

give more condensed polyaromatic products which are undesirable even as the end

products of thermal cracking have a more aliphatic nature (alkanes and

olefins).

The

novelty of this work lies in the implementation of a constrained simultaneous

CPD of multiple tensors using an optimization approach [4]

of solving for the decision variables (factors) using the gradients of the

objective function, which is formulated as the reconstruction error of the

tensors from their multi-modal factors. This is an improvement over the

Canonical Polyadic Alternating Least Squares (CPALS) approach, typically used for

the unconstrained decomposition of just one tensor block. CPALS is not very

accurate as it is not guaranteed to converge to a stationary point. Besides,

the simultaneous CPD developed in this work is designed to handle missing data

by imputing them by a weighting matrix in the objective function of the

optimization framework. The causal DAG among the factors obtained from jointly

decomposing both the tensors (FTIR and ^{1}HNMR

measurements) is developed using Bayesian networks and

is representative of the underlying chemical pathways among the

pseudo-components. This work facilitates jointly mining spectral measurements

in the framework of constrained data fusion to make the factors physically

interpretable so that a first pass to building causal inferential models to

generate reaction network hypotheses from process data (spectral measurements) could

be implemented.

**References**

[1] T.

G. Kolda and B. W. Bader, “Tensor Decompositions and Applications ,”* SIAM
Rev.*, vol. 51, no. 3, pp. 455–500, 2009.

[2] D. T. Tefera, L.

M. Yañez Jaramillo, R. Ranjan, C. Li, A. De Klerk, and V. Prasad, “A bayesian learning

approach to modeling pseudoreaction networks for complex reacting systems:

Application to the mild visbreaking of bitumen,” *Ind. Eng. Chem. Res.*,

vol. 56, no. 8, pp. 1961–1970, 2017.

[3] R. Bro and H.A.

Kiers, “A new efficient method for determining the number of components in

PARAFAC models,” *J.Chemometrics*, Vol.17,274-286, 2003.

[4] E. Acar, D. M.

Dunlavy, and T. G. Kolda, “A scalable optimization approach for fitting canonical

tensor decompositions ,” vol. 25, no. 2, pp. 67–86, 2011.

** **

** **

** **

### Checkout

This paper has an Extended Abstract file available; you must purchase the conference proceedings to access it.

### Do you already own this?

Log In for instructions on accessing this content.

### Pricing

####
**Individuals**

AIChE Pro Members | $150.00 |

AIChE Graduate Student Members | Free |

AIChE Undergraduate Student Members | Free |

AIChE Explorer Members | $225.00 |

Non-Members | $225.00 |