(202l) Gtm Semi-Supervised Approach for State Recognition in Dynamic Data | AIChE

(202l) Gtm Semi-Supervised Approach for State Recognition in Dynamic Data

Authors 

Escobar, M. - Presenter, University of Tokyo
Funatsu, K., The University of Tokyo
Kaneko, H., The University of Tokyo



GTM semi-supervised
approach for state recognition in dynamic data

M. S. Escobar, H. Kaneko and K. Funatsu

Department
of Chemical System Engineering

The
University of Tokyo,
Hongo 7-3-1, Bunkyo-ku,
Tokyo 113-8656, Japan.

e-mail:{mescobar,
hkaneko, funatsu} @ chemsys.t.u-tokyo.ac.jp

In chemical plants, it is
fundamental to understand the existent relationships between variables and
between samples along a certain dynamic data series. Once the amount of
variables are within our visible range (2D or 3D), analysis can be easily
performed. When the total number of features involved is much higher, though,
being able to assess how the samples and variables among that data are truly
connected can be a complex task.

When it comes to unsupervised methodologies, where
only input data sets are used, Generative Topographic Mapping (GTM) is a widely
used technique applied for visualization of sampled data with several input
variables. It
consists on a probabilistic non-linear approach, where a low-dimensional latent
space grid, usually 2D, is represented as a high-dimensional manifold on the
original data space. Since a manifold does not really represent the original
data, it is said that each point in this manifold has noise shaped as a
Gaussian function. In order to cope with this transitions keeping computational
load reasonable, a function composed of several radial basis functions (RBFs)
is constructed and its parameters are determined via a so-called
Expectation-Maximization (EM) algorithm. 
(Bishop, Svensén, and Williams 1998). Once the map is trained, it is
possible to determine for each sample, the likelihood of it belonging to each
latent point. The mean or mode plots of this distribution gives the
visualization desired, where similar data will eventually cluster together. How
to determine these clusters sometimes can be difficult, however, since the
amount of parameters involved in GTM is big and the clusters might not be
entirely separated. Available output data would help in such visualization,
giving a hint of where similar data might be. A non-linear mapping from an L-dimensional
latent space to a D-dimensional
data space X, where y(z,W) defines
a non-Euclidean manifold. y(z,W) is the
function that connects both spaces, depending on the latent points z and the
matrix W, which
weights the pre-set RBFs. The grid of points in latent space, then, will be
mapped as centers for each Gaussian components in the D-dimensional
space.

GTM assumes, however, that
data is independently and identically distributed (i.i.d), which means that no
sample is related whatsoever with other samples. When clearly unrelated data
are being used, like a census, this assumption is fair. For dynamic data,
though, it is evident that samples closer in time are more related than the
ones far before or after. In that sense, strategies have been developed to
overcome these limitations. One of those is called GTM through time (GTM-TT),
where state transitions over time are taken into account by using Hidden Markov
Models (HMMs) (Bishop, Hinton, and Strachan 1997). HMM relies on states, which
represent events or regions in a system, and observations, which are the values
associated to these states. For GTM-TT, hidden states are the latent points and
observations are the input data. Assuming temporal data, transition
probabilities from one hidden state to another (latent points) are optimized
along the usual GTM parameters, using GTM mixture distribution as emission
probabilities to relate the latent space with the original data space. In the
end, the state sequence is superposed as a unique GTM map.

When it comes to statistical
data analysis, though, two different approaches can be used for data
understanding:  unsupervised
learning, where only input variables are used for analysis, and supervised
learning, where inputs and outputs (labels) work together for data
understanding. Being able to gather those labels, though, can be a rather
troublesome matter. Output variables are usually measured offline, since they
might represent properties that can only be measured in a laboratory, with
specific equipment. Other times, a good measurement is just too time consuming
or rather complex to be implemented. In the end, there are usually more input
data available than output data.

To cope with this discrepancy,
a hybrid approach has been explored in the last years, called semi-supervised
learning (Chapelle, Schölkopf, and Zien 2010), where both data can be
simultaneously used for an even better understanding of a system. Since it is a
mixed methodology, it can be used for both supervised learning applications,
such as regression and classification, and unsupervised learning applications,
such as clustering, pattern recognition and anomaly detection.

This work's
proposal focuses on using GTM and its variations with a semi-supervised
approach, where output data can help with data clustering and, by achieving
that, anomaly detection and state recognition features can be explored.

Two main approaches
would be taken into consideration: the improvement of clustering discrimination
before and after the GTM-TT map being trained. Describing it more clearly, one
viable approach would be to consider not only input data for GTM-TT, but also
the Y data available, checking if there is any improvement in the overall
cluster discrimination. This approach, henceforth, would alter the GTM-TT
structure before training. It is evident, though, that the amount of Y data
will not be the same as the amount of input data, what can be overcome knowing
that GTM-TT can comply with missing data. Several works were published
assessing this matter, but only for input values (Vellido 2006,
Vellido et al. 2007). The missing values act as GTM-TT parameters to be
optimized, along with the other parameters yet to be tuned. In the end, the
analysis consists on analyzing the accuracy and improvement of clustering
discrimination for different percentages of missing data.

Another approach,
useful when the map have already been trained, is to use already developed
semi-supervised learning methods for clustering. By assuming that each cluster
is associated to one Gaussian process, for example, it is possible to find
clusters around the output values, using input data for parameter estimation (Zhu and Goldberg 2009). If this technique were to be applied in GTM-TT
maps, the outputs connected with their respective inputs could be represented
in the GTM-TT map and then clustering could be improved based on those
reference points. So far, preliminary results were obtained by using a
simplified semi-supervised approach, called "cluster then label" (Zhu and
Goldberg 2009). Giving a data set where the amount of unlabeled data is far
greater than labeled data an unsupervised method is used for clustering this
data, followed by a supervised method to be applied on each cluster, for only those
data that have labels. By doing so, features arisen from both unsupervised and
supervised approach can be used. For this work, GTM-TT plots were coupled with such
approach, in order to check the improvement in classification and cluster
discrimination. Average Linkage Clustering (ALC) and 5-fold Support Vector Machine
were used as unsupervised and supervised methods, respectively, for state recognition
(classification) over time in a six-tank laboratorial plant. This system
possesses four inputs related to the water flow delivered to those tanks. As
for the states, the level of each tank can be considered as one state, where,
for now, one level only was taken into consideration.

For this analysis, four
different approaches were considered. Initially, as a standard analysis,
supervised classification was applied by using a complete set of inputs and
their respective outputs (SP). Second, maintaining supervised approach, a
smaller dataset of inputs with their respective outputs was extracted from this
original data set and its model was tested against all inputs available (SmaP).
Third, only inputs were used, characterizing unsupervised prediction (UP).
Finally, all inputs and the same output data set from SmaP were used, for the
aforementioned semi-supervised approach (SSP). The results related to the
quality of classification can be seen in table 1.

Table 1. Classification accuracy for
supervised, unsupervised and semi-supervised approaches.

Method

Accuracy

SP

0.9983

SmaP

0.8266

UP

0.6717

SSP

0.9747

These results show
a clear improvement in the final accuracy when the proposed semi-supervised
learning approach is used, indicating, even though as mere speculation, the
potential of these technique in the upcoming research. The overall results with
different techniques are to be compared and presented with unsupervised
clustering methods and supervised classification methods, regarding the
accuracy of discrimination applied to anomaly detection.

REFERENCES

Bishop, C. M., G.
E. Hinton, and I. G. D. Strachan. 1997. GTM Through Time. In Proceedings IEE Fifth International Conference on Artificial Neural
Networks
. Cambridge, U.K.

Bishop, C. M., M.
Svensén, and C. K. I. Williams. 1998. "GTM: The Generative Topographic
Mapping."  no. 10 (1):215-234.

Chapelle, O., B.
Schölkopf, and A. Zien. 2010.
Semi-supervised learning
: Mit Press.

Vellido, A., E.
Martí, J. Comas, I. Rodríguez-Roda, and F. Sabater. 2007. "Exploring the
ecological status of human altered streams through Generative Topographic
Mapping." Environmental Modelling
& Software
no. 22 (7):1053-1065. doi:
http://dx.doi.org/10.1016/j.envsoft.2006.06.005.

Vellido, Alfredo.
2006. "Missing data imputation through GTM as a mixture of
-distributions." Neural Networks
no. 19 (10):1624-1635. doi: http://dx.doi.org/10.1016/j.neunet.2005.11.003.

Zhu, X., and A. B.
Goldberg. 2009. Introduction to
Semi-supervised Learning:
Morgan & Claypool Publishers.

 ADDIN EN.REFLIST

Topics 

Checkout

This paper has an Extended Abstract file available; you must purchase the conference proceedings to access it.

Checkout

Do you already own this?

Pricing

Individuals

AIChE Pro Members $150.00
AIChE Graduate Student Members Free
AIChE Undergraduate Student Members Free
AIChE Explorer Members $225.00
Non-Members $225.00