(254a) GPU Parameter Tuning for Dense Linear Least Squares Problems
The primary contribution of this work is to propose systematic methods for tuning GPU or hybrid CPU/GPU algorithms through the use of derivative-free optimization (DFO)  and simulation optimization (SO) . A second contribution of our work is to provide a comparison of thirty-one DFO and four SO solvers in the context of tuning GPU algorithms.
To determine a baseline of performance, we evaluated the performance of a few of the most well-known dense linear algebra libraries: LAPACK , a multicore solver PLASMA , a GPU only algorithm cuSolverDN , and a hybrid implementation MAGMA . We evaluated each of these solvers over a wide range of different sized square and tall and skinny matrices. Tall and skinny matrices commonly arise when solving LLSPs, and are typically challenging to obtain high performance on because of the matrix structure. For square matrices, the solver MAGMA was able to outperform all of the other solvers. However, for tall and skinny matrices MAGMA was not able to perform as well as the other solvers.
Our computational results show that the best DFO solver is able to speed up the performance of MAGMA by 1.67x compared to default MAGMA parameters. After tuning MAGMA through the proposed approach, MAGMA was able to outperform all other solvers for tall and skinny matrices.
 S. Amaran, N. V. Sahinidis, B. Sharda, and S. J. Bury. Simulation optimization: A review of algorithms and applications. Annals of Operations Research, 240:351â380, 2016.
 E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen. LAPACK usersâ guide (third ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1999.
 M. Anderson, G. Ballard, J. Demmel, and K. Keutzer. Communication-avoiding QR decomposition for GPUs. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 48â58, 2011.
 B. Hadri, H. Ltaief, E. Agullo, and J. Dongarra. Tile QR factorization with parallel panel processing for multicore architectures. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1â10, 2010.
 NVIDIA Corporation. cuSolver, Current as of 28 December, 2016. http://docs.nvidia.com/cuda/cusolver/#axzz4SZ3ssQJO.
 L. M. Rios and N. V. Sahinidis. Derivative-free optimization: A review of algorithms and comparison of software implementations. Journal of Global Optimization, 56:1247â1293, 2013.
 S. Tomov, J. Dongarra, and M. Baboulin. Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Computing, 36:232â240, 2010.