(74d) Parallel Solution of Large-Scale Nonlinear Programming Problems On Modern Computing Architectures
AIChE Annual Meeting
2009
2009 Annual Meeting
Computing and Systems Technology Division
Advances in Optimization I
Monday, November 9, 2009 - 1:30pm to 1:50pm
Large-scale nonlinear programming (NLP) has proven to be an effective framework for improving process efficiency and sustainability. However, the scale of the NLP problems of interest to both industry and academia continues to grow increasingly large, potentially outstripping the capacity of a single CPU workstation. Furthermore, computer chip manufacturers are no longer focusing on increasing clock speeds and instruction throughput, but rather on multi-core architectures and hyper-threading. This means that the ?free? performance improvements that we have enjoyed as a result of advances in computing hardware will no longer be possible unless we develop algorithms that are capable of utilizing modern concurrent architectures efficiently. However, all parallel computing architectures are not created equal. We will discuss the key differences in several parallel architectures available for scientific computing, including distributed clusters, general multicore systems, graphics processing units (GPU), the CELL Broadband Engine Architecture, and massively threaded supercomputers like the Cray XMT.
In previous work, we have presented an internal decomposition algorithm for the parallel solution of multi-scenario problems. We extend this approach and provide a stable decomposition of time-discretized formulations with pass-on variables. We demonstrate the performance of these parallel decomposition strategies on multiple parallel architectures. Distributed clusters operate using a multiple-instruction-multiple-data (MIMD) architecture, and the decomposition approach is implemented using independent processes that communicate through a message-passing interface (e.g. MPICH). Modern scientific computing architectures like GPUs provide significantly more processing cores per machine (e.g. 128 cores), however, these systems are typically single-instruction-multiple-data (SIMD) and have specialized kernel requirements and memory layouts. We present a fixed pivoting factorization technique that allows efficient parallel solution of structured nonlinear programming problems on GPU architectures. Several case-studies in optimal design and operation show that the distributed architecture is appropriate for coarse grained parallelization with tens of processors, while the GPU architecture is most appropriate for fine grained parallelization in real-time applications.