(266f) Fault Tolerant Computing through Machine Learning

Authors: 
Sroczynski, D., Princeton University
Kyauk, C., Princeton University
Kevrekidis, I. G., Princeton University
Villoutreix, P., Princeton University
Anden, J., Princeton University
In modern, massively parallel scientific computation, domain decomposition approaches lead
to different segments of a domain, and different subfields/equations solved for in each segment,
being computed on different processors. 
If a processor fails during a "computation era", before information is exchanged between nodes,
one is faced with a serious problem about if and how the computation can proceed.

In many cases, the different fields that these processors compute are all functions of some 
intrinsic lower-dimensional coarse variables (e.g., time during the computation, long-wavelength
features of the solution). 
If the computational algorithms share some such common information, 
we can use machine learning, and in particular diffusion maps, a nonlinear manifold learning algorithm, to 
``register" the computational data in the coarse space and to ``fill in", to the best of our ability, data that
are missing or corrupted because of a processor failure.

This allows us to learn functional relationships between aspects of the data fields that are
not common across processors, effectively fusing the data sets.

We demonstrate our approach on two illustrative PDE systems with various spatiotemporal patterns of missing data.

The approach meshes well with equation-free computation schemes, in particular with patch dynamics;
beyond helping to partially restore corrupted or missing data, it can help determine
the size of the computational domain over which simulations need not be performed,
and can help determine processor redundancy for different anticipated failure patterns.

This is joint work with Prof. G. Karniadakis and Dr. Seungjoon Lee at Brown University.