2016 AIChE Annual Meeting

(266f) Fault Tolerant Computing through Machine Learning

Authors

David Sroczynski - Presenter, Princeton University

Ioannis G. Kevrekidis, Princeton University

Joakim Anden, Princeton University

In modern, massively parallel scientific computation, domain decomposition approaches lead
to different segments of a domain, and different subfields/equations solved for in each segment,
being computed on different processors.
If a processor fails during a "computation era", before information is exchanged between nodes,
one is faced with a serious problem about if and how the computation can proceed.

In many cases, the different fields that these processors compute are all functions of some
intrinsic lower-dimensional coarse variables (e.g., time during the computation, long-wavelength
features of the solution).
If the computational algorithms share some such common information,
we can use machine learning, and in particular diffusion maps, a nonlinear manifold learning algorithm, to
``register" the computational data in the coarse space and to ``fill in", to the best of our ability, data that
are missing or corrupted because of a processor failure.

This allows us to learn functional relationships between aspects of the data fields that are
not common across processors, effectively fusing the data sets.

We demonstrate our approach on two illustrative PDE systems with various spatiotemporal patterns of missing data.

The approach meshes well with equation-free computation schemes, in particular with patch dynamics;
beyond helping to partially restore corrupted or missing data, it can help determine
the size of the computational domain over which simulations need not be performed,
and can help determine processor redundancy for different anticipated failure patterns.

This is joint work with Prof. G. Karniadakis and Dr. Seungjoon Lee at Brown University.

List Price	225.00
AIChE Pro Members	150.00
AIChE Graduate Student Members	Free
AIChE Undergraduate Student Members	Free

Breadcrumb

2016 AIChE Annual Meeting

(266f) Fault Tolerant Computing through Machine Learning

Authors