(202l) Gtm SemiSupervised Approach for State Recognition in Dynamic Data
AIChE Annual Meeting
2013
2013 AIChE Annual Meeting
Computing and Systems Technology Division
Poster Session: Systems and Process Control
Monday, November 4, 2013  3:15pm to 5:45pm
GTM semisupervised
approach for state recognition in dynamic data
M. S. Escobar, H. Kaneko and K. Funatsu
Department
of Chemical System Engineering
The
University of Tokyo, Hongo 731, Bunkyoku,
Tokyo 1138656, Japan.
email:{mescobar,
hkaneko, funatsu} @ chemsys.t.utokyo.ac.jp
In chemical plants, it is
fundamental to understand the existent relationships between variables and
between samples along a certain dynamic data series. Once the amount of
variables are within our visible range (2D or 3D), analysis can be easily
performed. When the total number of features involved is much higher, though,
being able to assess how the samples and variables among that data are truly
connected can be a complex task.
When it comes to unsupervised methodologies, where
only input data sets are used, Generative Topographic Mapping (GTM) is a widely
used technique applied for visualization of sampled data with several input
variables. It
consists on a probabilistic nonlinear approach, where a lowdimensional latent
space grid, usually 2D, is represented as a highdimensional manifold on the
original data space. Since a manifold does not really represent the original
data, it is said that each point in this manifold has noise shaped as a
Gaussian function. In order to cope with this transitions keeping computational
load reasonable, a function composed of several radial basis functions (RBFs)
is constructed and its parameters are determined via a socalled
ExpectationMaximization (EM) algorithm.
(Bishop, Svensén, and Williams 1998). Once the map is trained, it is
possible to determine for each sample, the likelihood of it belonging to each
latent point. The mean or mode plots of this distribution gives the
visualization desired, where similar data will eventually cluster together. How
to determine these clusters sometimes can be difficult, however, since the
amount of parameters involved in GTM is big and the clusters might not be
entirely separated. Available output data would help in such visualization,
giving a hint of where similar data might be. A nonlinear mapping from an Ldimensional
latent space to a Ddimensional
data space X, where y(z,W) defines
a nonEuclidean manifold. y(z,W) is the
function that connects both spaces, depending on the latent points z and the
matrix W, which
weights the preset RBFs. The grid of points in latent space, then, will be
mapped as centers for each Gaussian components in the Ddimensional
space.
GTM assumes, however, that
data is independently and identically distributed (i.i.d), which means that no
sample is related whatsoever with other samples. When clearly unrelated data
are being used, like a census, this assumption is fair. For dynamic data,
though, it is evident that samples closer in time are more related than the
ones far before or after. In that sense, strategies have been developed to
overcome these limitations. One of those is called GTM through time (GTMTT),
where state transitions over time are taken into account by using Hidden Markov
Models (HMMs) (Bishop, Hinton, and Strachan 1997). HMM relies on states, which
represent events or regions in a system, and observations, which are the values
associated to these states. For GTMTT, hidden states are the latent points and
observations are the input data. Assuming temporal data, transition
probabilities from one hidden state to another (latent points) are optimized
along the usual GTM parameters, using GTM mixture distribution as emission
probabilities to relate the latent space with the original data space. In the
end, the state sequence is superposed as a unique GTM map.
When it comes to statistical
data analysis, though, two different approaches can be used for data
understanding: unsupervised
learning, where only input variables are used for analysis, and supervised
learning, where inputs and outputs (labels) work together for data
understanding. Being able to gather those labels, though, can be a rather
troublesome matter. Output variables are usually measured offline, since they
might represent properties that can only be measured in a laboratory, with
specific equipment. Other times, a good measurement is just too time consuming
or rather complex to be implemented. In the end, there are usually more input
data available than output data.
To cope with this discrepancy,
a hybrid approach has been explored in the last years, called semisupervised
learning (Chapelle, Schölkopf, and Zien 2010), where both data can be
simultaneously used for an even better understanding of a system. Since it is a
mixed methodology, it can be used for both supervised learning applications,
such as regression and classification, and unsupervised learning applications,
such as clustering, pattern recognition and anomaly detection.
This work's
proposal focuses on using GTM and its variations with a semisupervised
approach, where output data can help with data clustering and, by achieving
that, anomaly detection and state recognition features can be explored.
Two main approaches
would be taken into consideration: the improvement of clustering discrimination
before and after the GTMTT map being trained. Describing it more clearly, one
viable approach would be to consider not only input data for GTMTT, but also
the Y data available, checking if there is any improvement in the overall
cluster discrimination. This approach, henceforth, would alter the GTMTT
structure before training. It is evident, though, that the amount of Y data
will not be the same as the amount of input data, what can be overcome knowing
that GTMTT can comply with missing data. Several works were published
assessing this matter, but only for input values (Vellido 2006,
Vellido et al. 2007). The missing values act as GTMTT parameters to be
optimized, along with the other parameters yet to be tuned. In the end, the
analysis consists on analyzing the accuracy and improvement of clustering
discrimination for different percentages of missing data.
Another approach,
useful when the map have already been trained, is to use already developed
semisupervised learning methods for clustering. By assuming that each cluster
is associated to one Gaussian process, for example, it is possible to find
clusters around the output values, using input data for parameter estimation (Zhu and Goldberg 2009). If this technique were to be applied in GTMTT
maps, the outputs connected with their respective inputs could be represented
in the GTMTT map and then clustering could be improved based on those
reference points. So far, preliminary results were obtained by using a
simplified semisupervised approach, called "cluster then label" (Zhu and
Goldberg 2009). Giving a data set where the amount of unlabeled data is far
greater than labeled data an unsupervised method is used for clustering this
data, followed by a supervised method to be applied on each cluster, for only those
data that have labels. By doing so, features arisen from both unsupervised and
supervised approach can be used. For this work, GTMTT plots were coupled with such
approach, in order to check the improvement in classification and cluster
discrimination. Average Linkage Clustering (ALC) and 5fold Support Vector Machine
were used as unsupervised and supervised methods, respectively, for state recognition
(classification) over time in a sixtank laboratorial plant. This system
possesses four inputs related to the water flow delivered to those tanks. As
for the states, the level of each tank can be considered as one state, where,
for now, one level only was taken into consideration.
For this analysis, four
different approaches were considered. Initially, as a standard analysis,
supervised classification was applied by using a complete set of inputs and
their respective outputs (SP). Second, maintaining supervised approach, a
smaller dataset of inputs with their respective outputs was extracted from this
original data set and its model was tested against all inputs available (SmaP).
Third, only inputs were used, characterizing unsupervised prediction (UP).
Finally, all inputs and the same output data set from SmaP were used, for the
aforementioned semisupervised approach (SSP). The results related to the
quality of classification can be seen in table 1.
Table 1. Classification accuracy for
supervised, unsupervised and semisupervised approaches.
Method

Accuracy

SP

0.9983

SmaP

0.8266

UP

0.6717

SSP

0.9747

These results show
a clear improvement in the final accuracy when the proposed semisupervised
learning approach is used, indicating, even though as mere speculation, the
potential of these technique in the upcoming research. The overall results with
different techniques are to be compared and presented with unsupervised
clustering methods and supervised classification methods, regarding the
accuracy of discrimination applied to anomaly detection.
REFERENCES
Bishop, C. M., G.
E. Hinton, and I. G. D. Strachan. 1997. GTM Through Time. In Proceedings IEE Fifth International Conference on Artificial Neural
Networks. Cambridge, U.K.
Bishop, C. M., M.
Svensén, and C. K. I. Williams. 1998. "GTM: The Generative Topographic
Mapping." no. 10 (1):215234.
Chapelle, O., B.
Schölkopf, and A. Zien. 2010.
Semisupervised learning: Mit Press.
Vellido, A., E.
Martí, J. Comas, I. RodríguezRoda, and F. Sabater. 2007. "Exploring the
ecological status of human altered streams through Generative Topographic
Mapping." Environmental Modelling
& Software no. 22 (7):10531065. doi:
http://dx.doi.org/10.1016/j.envsoft.2006.06.005.
Vellido, Alfredo.
2006. "Missing data imputation through GTM as a mixture of
distributions." Neural Networks
no. 19 (10):16241635. doi: http://dx.doi.org/10.1016/j.neunet.2005.11.003.
Zhu, X., and A. B.
Goldberg. 2009. Introduction to
Semisupervised Learning: Morgan & Claypool Publishers.
ADDIN EN.REFLIST
Topics
Checkout
This paper has an Extended Abstract file available; you must purchase the conference proceedings to access it.
Do you already own this?
Log In for instructions on accessing this content.
Pricing
Individuals
AIChE Pro Members  $150.00 
AIChE Graduate Student Members  Free 
AIChE Undergraduate Student Members  Free 
AIChE Explorer Members  $225.00 
NonMembers  $225.00 