After last week’s work on the schedule clause, num_threads, and atomic construct, Week 9 took a different turn. I decided to dive into researching OpenMP target offloading, which lets us run code on GPUs. Last week, I planned to add more constructs like simd, but I shifted focus to understand how offloading works, especially since I have access to an NVIDIA GPU through my institute’s HPC server. This week, I spent about 21 hours exploring this new area and setting up the tools needed, which can be found at #4497(comment).

Researching Target Offloading

I started by figuring out how to explore OpenMP target offloading. I had two choices: look at how GCC handles it or check out Clang’s approach. Since GCC has limited support for NVIDIA GPUs (the only type I can use), I went with Clang because it works better with NVIDIA and is based on LLVM, which fits well with LFortran’s backend. This felt like a good starting point to learn how to offload code to my GPU.

Setting Up Clang for GPU Offloading

My goal was to get Clang working for target offloading on my NVIDIA GPU. This meant installing all the right tools, but I wanted to keep it simple using Conda so others can follow along easily. First, I tried building the LLVM project from scratch, but it kept failing. I think it was because I couldn’t find the perfect mix of versions and build flags. I also struggled a lot with installing different versions of Clang, LLVM, CUDA Toolkit, and other tools to find combinations that work together. After many tries, I decided to use prebuilt binaries instead, which saved me time and worked better.

What I Learned About Target Offloading

Here’s what I discovered about how OpenMP target offloading works:

1. OpenMP Target Offloading Architecture

OpenMP offloading creates two sets of code: one for the host (CPU) and one for the target device (GPU). The GPU code is embedded into the host code to make a “fat object.” A special tool then pulls out the GPU code, links it, and adds it back into the final program so the host can use it on the GPU.

The setup looks like this:

  • Host (CPU) runs the main program and uses a runtime library called libomptarget.
  • Device (GPU) runs special kernels (like CUDA code) with its own runtime support.
  • They connect through a plugin that talks to the GPU.

This system has three main parts:

  • Host Runtime (libomptarget.so): Handles device setup, memory moves, and kernel launches. Without it, OpenMP target commands wouldn’t work.
  • CUDA Plugin (libomptarget.rtl.cuda.so): Links the runtime to NVIDIA’s CUDA system, managing GPU memory and launches.
  • Device Runtime (libomptarget.devicertl.a): Runs OpenMP features like thread teams on the GPU.

2. Key Components

Each part has specific jobs:

  • libomptarget.so: Sets up the GPU, moves data, and runs kernels. It’s the heart of the system.
  • libomptarget.rtl.cuda.so: Connects to CUDA for GPU tasks like memory and context management.
  • libomptarget.devicertl.a: Handles GPU-side OpenMP features like synchronization and memory.
  • Bitcode Files (.bc): Special files like libomptarget-nvptx-sm_89.bc for different GPU types (e.g., sm_89 for my NVIDIA L4).
  • clang-offload-packager: Puts GPU code into the host file.
  • clang-linker-wrapper: Links the GPU code and adds it to the final program.
  • clang-offload-bundler: Combines host and GPU code into one file.

3. How Compilation Works

The process to create a program with GPU support has four steps:

  • Host Compilation: Turns the code into a CPU object file and prepares offloading parts.
  • Device Compilation: Makes GPU code and links it with the device runtime.
  • Bundling: Combines CPU and GPU code into a fat object.
  • Final Linking: Links the GPU code and builds the final program.

Steps to Set Up and My Findings

Here’s how I set up Clang for target offloading on my system, which has an AMD EPYC 7742 CPU and two NVIDIA L4 GPUs:

I created a Conda environment called openmp-llvm18_tt with these commands:

conda create -n openmp-llvm18_tt -c conda-forge -y \
        python=3.11 \
        clang=18.1.8 \
        llvm=18.1.8 \
        llvm-openmp=18.1.8 \
        cuda-toolkit=12.4 \
        cuda-nvcc=12.4 \
        cmake=3.27 \
        wget \
        tar
    conda activate openmp-llvm18_tt

Then, I downloaded prebuilt LLVM 18.1.8 with Clang support and set it up:

cd /tmp/clang_llvm_18
    wget -q --show-progress 
    https://github.com/llvm/llvm-project/releases/download/llvmorg-18.1.8/clang+llvm-18.1.8-x86_64-linux-gnu-ubuntu-18.04.tar.xz
    tar -xvf clang+llvm-18.1.8-x86_64-linux-gnu-ubuntu-18.04.tar.xz
    export LLVM_DIR="$PWD/clang+llvm-18.1.8-x86_64-linux-gnu-ubuntu-18.04"

I copied the runtime libraries, bitcode files, and offloading tools to the Conda environment:

cp $LLVM_DIR/lib/libomp.so* $CONDA_PREFIX/lib/
    cp $LLVM_DIR/lib/libomptarget.so* $CONDA_PREFIX/lib/
    cp $LLVM_DIR/lib/libomptarget.rtl.*.so* $CONDA_PREFIX/lib/
    mkdir -p $CONDA_PREFIX/lib/clang/18.1.8/lib/
    mkdir -p $CONDA_PREFIX/lib/clang/18.1.8/lib/nvptx64-nvidia-cuda
    cp $LLVM_DIR/lib/libomptarget.devicertl.a $CONDA_PREFIX/lib/
    cp $LLVM_DIR/lib/libomptarget.devicertl.a $CONDA_PREFIX/lib/clang/18.1.8/lib/
    cp $LLVM_DIR/lib/libomptarget-nvptx-sm_*.bc $CONDA_PREFIX/lib/clang/18.1.8/lib/nvptx64-nvidia-cuda
    cp $LLVM_DIR/bin/clang-linker-wrapper $CONDA_PREFIX/bin/
    cp $LLVM_DIR/bin/clang-offload-packager $CONDA_PREFIX/bin/

I also added activation and deactivation scripts to set environment variables:

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
    cat > $CONDA_PREFIX/etc/conda/activate.d/openmp_llvm18_tt.sh << 'EOF'
    export CUDA_HOME=$CONDA_PREFIX
    export CUDA_PATH=$CONDA_PREFIX
    export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$CONDA_PREFIX/lib/clang/18.1.8/lib:$LD_LIBRARY_PATH
    export LD_LIBRARY_PATH="/usr/lib/x86_64-linux-gnu:$CONDA_PREFIX/lib:$LD_LIBRARY_PATH"
    export LIBRARY_PATH=$CONDA_PREFIX/lib:$CONDA_PREFIX/lib/clang/18.1.8/lib:$LIBRARY_PATH
    export OMP_TARGET_OFFLOAD=MANDATORY
    export LIBOMPTARGET_DEVICE_ARCHITECTURES=sm_89
    export LIBOMPTARGET_INFO=1
    export LIBOMPTARGET_NVPTX_BC_PATH=$CONDA_PREFIX/lib/clang/18.1.8/lib/nvptx64-nvidia-cuda
    export CLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_89
    echo "✓ OpenMP GPU environment activated"
    EOF
    chmod +x $CONDA_PREFIX/etc/conda/activate.d/openmp_llvm18_tt.sh

mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d
    cat > $CONDA_PREFIX/etc/conda/deactivate.d/openmp_llvm18_tt.sh << 'EOF'
    #!/bin/bash
    unset CUDA_HOME CUDA_PATH OMP_TARGET_OFFLOAD
    unset LIBOMPTARGET_DEVICE_ARCHITECTURES LIBOMPTARGET_INFO LIBOMPTARGET_NVPTX_BC_PATH
    unset CLANG_OPENMP_NVPTX_DEFAULT_ARCH
    EOF
    chmod +x $CONDA_PREFIX/etc/conda/deactivate.d/openmp_llvm18_tt.sh

    conda deactivate
    conda activate openmp-llvm18_tt

These steps worked on my system, and I hope they can help others get started too. The setup ensures Clang can compile code for my NVIDIA L4 GPU with the sm_89 architecture.

Next Steps

For Week 10, I plan to:

  • Figure out a way to make LFortran work with the same.
  • Document any challenges or additional setup steps I find.

I thank my mentors, Ondrej Certik, Pranav Goswami, and Gaurav Dhingra, for their support as I explored this new area. I also appreciate the LFortran community for their encouragement.