GPGPU Offloading Performance in XL with OpenMP 4.5

CASCON Compiler Driven Performance (CDP) 2016

OpenMP 4.5 allows the offloading of computation to Graphics Processing Units (GPUs) in programs written in C, C++, and Fortran without requiring a detailed understanding of General-Purpose Computing in GPUs (GPGPUs), but at what cost? This talk contrasts OpenMP 4.5 to competing programming models: OpenACC, OpenCL, and CUDA. All these programming models allow for the execution of code on GPGPUs but differ in philosophy, with implications for performance, ease of use, and ease of implementation. This presentation illustrates these differences through a toy example and describes the implementation of OpenMP 4.5 in the XL Compiler. The talk then discusses the NVidia CUDA GPGPU programming model, exploring challenges in mapping OpenMP 4.5 constructs to GPGPUs.In particular, it discusses issues that arise with the sharing of variables between threads and with the implementation of nested parallelism. The talk examines the strengths and weaknesses of alternative GPGPU code generation schemes, compares the performance of OpenMP 4.5 in XL with CUDA implementations of the same programs, and identifies sources of overhead.