FOSDEM 2024
/
Schedule
/
Events
/
Developer rooms
/
HPC, Big Data & Data Science
/
AOMP Compiler Kung Fu: Mastering Optimization Flags and Environment Variables for Performance

AOMP Compiler Kung Fu: Mastering Optimization Flags and Environment Variables for Performance

Track: HPC, Big Data & Data Science devroom
Room: UA2.118 (Henriot)
Day: Saturday
Start: 12:30
End: 13:00
Video only: ua2118
Chat: Join the conversation!

The world’s largest supercomputers get most of their potential from their GPU accelerators. As such, it is essential to target GPUs when running on such powerful machines. Given the desire for portability of applications, OpenMP offloading is a popular way to accelerate applications on GPUs.

This talk presents the ROCm OpenMP offloading compiler from two perspectives: compiler flags and compiler/runtime implementation. Given that it may be critical to use the right compiler flags, the talk presents some of the tuning knobs that the ROCm OpenMP compiler exposes to users. It will go into more detail of what optimization, feature, and parameter the specific compiler flags enable. For each of the compiler flags listed, the talk will present information on what occurs in the compiler and the runtime library. For several of these switches, the talk presents performance results.

In particular, the talk covers the high-level flag -fopenmp-target-fast and what optimizations are enabled by that switch. It will then go into more detail about the specific flags for the selected underlying optimizations and their user-facing parameter options that allow for fine-tuning in certain scenarios. Specifically, the talk covers the following flags:

-fopenmp-gpu-threads-per-team: This switch controls kernel launch parameters considered during the compiler’s code generation phase, which can impact things like register allocation.
-fopenmp-target-big-jump-loop: This switch enables a more optimized form of loops for running on the GPU that is closer to traditional block-languages like CUDA or HIP.
-fopenmp-target-no-loop: This switch enables the generation of CUDA-like code without any loop executed on the device. The talk outlines which conditions must be met to allow the compiler to generate such GPU code.
-fopenmp-target-xteam-reductions: A special form of reduction kernels that can greatly benefit the performance of reductions. One of its main tuning knobs is the blocksize.
-fopenmp-target-xteam-reduction-blocksize: The xteam reductions blocksize and what are the implications.
-fopenmp-force-usm: This flag is a convenient switch that enables the compiler to emit code as if the user specified #pragma omp unified_shared_memory.

In addition to the compiler flags, the runtime offers additional configurability via environment variables. Some of these variables influence the very same mechanisms, whereas other provide separate mechanisms. The talk gives an overview of some of these variables and how they influence execution.

The repositories can be found at https://github.com/ROCm. The AOMP development compiler (with build recipes etc) can be found at https://github.com/ROCm/aomp