After Week 9’s research into OpenMP target offloading with Clang, Week 10 focused on using LFortran’s C-backend to generate code for GPU offloading. Last week, I planned to test a simple offloading example and start adapting the OpenMP pass. This week, I worked on extending the C-backend to produce code that Clang can offload to my GPU-enabled machine, spending about 26 hours tackling challenges and learning new details about array handling and the map clause.

Approach for Target Offloading

This week, we decided to use LFortran’s C-backend to create C code that can be offloaded to a GPU using Clang. The plan was to generate proper C code from the OMPRegion ASR node, which Clang could then process to run on NVIDIA GPU. This approach lets us build on the existing C-backend and use Clang for offloading , which I researched last week. It was a new challenge, but it felt like a good step toward adding GPU support to LFortran.

Implementation and Challenges Faced

I extended the C-backend to handle OMPRegion nodes for target offloading. This meant adding support for OpenMP directives like target, teams, and distribute parallel do. However, I ran into several difficulties. LFortran arrays are stored as custom structs with fields like data, dimensions, strides, and lengths. To offload these to the GPU, I needed to send all these fields, not just the data, so the GPU could understand the array’s structure. Another issue was with statically allocated arrays (fixed-size arrays). These are not allocated with malloc, so their pointers caused conflicts when sending or receiving data. As it seems that GPU-device can't access the Stack of the Host memory, but the Heap of it. Switching to allocatable arrays solved this problem, as they are dynamically allocated and accessible via GPU.

During this process, I also learned about the map clause, which controls how data moves between host and device. The attributes to (read-only on GPU), from (write-only), and tofrom (read and write) are key. For my array structs, I set the data field to tofrom since it needs to be updated, and fields like lower_bound, length, and stride to to because they are only read by the GPU.

Challenges with CI Setup

I tried to set up this offloading support in the Continuous Integration (CI) system, but I hit a roadblock. GitHub’s CI runners don’t have GPU access, which is needed to test offloading. I have a Conda setup script that works on GPU-enabled machines, and I looked into GitLab, which can support GPU runners. However, LFortran moved away from GitLab a while ago, so that option isn’t available right now. I’m still searching for a way to test this in CI, perhaps by finding a different platform or runner with GPU support, but for now, I haven’t made a PR since testing is limited to my local GPU machine.

Example: Target Offloading with C-Backend

Below is the MRE I used to test the C-backend’s target offloading, along with the generated C code. This example offloads a large array computation to the GPU.

View MRE for Target Offloading (openmp_70.f90)
1program openmp_70
2    implicit none
3    real, allocatable, dimension(:) :: a, b
4    integer :: i
5    allocate(a(10000000), b(10000000))
6    b=5
7    !$omp target map(tofrom:a, b)
8    !$omp teams
9    !$omp distribute parallel do
10    do i = 1, 10000000
11        a(i) = real(i) + b(i)*340
12    end do
13    !$omp end distribute parallel do
14    !$omp end teams
15    !$omp end target
16    print*, a(5), b(5)
17    if(a(5) /= 1705) error stop
18    if(b(5) /= 5) error stop
19end program openmp_70
View Generated C Code (C-DUMP)
1#include <inttypes.h>
2
3#include <stdlib.h>
4#include <stdbool.h>
5#include <stdio.h>
6#include <string.h>
7#include <lfortran_intrinsics.h>
8
9struct dimension_descriptor
10{
11    int32_t lower_bound, length, stride;
12};
13
14struct r32
15{
16    float *data;
17    struct dimension_descriptor dims[32];
18    int32_t n_dims;
19    int32_t offset;
20    bool is_allocated;
21};
22
23float _lcompilers_real_i32(int32_t x);
24
25float _lcompilers_real_i32(int32_t x)
26{
27    float _lcompilers_real_i32;
28    _lcompilers_real_i32 = (float)(x);
29    return _lcompilers_real_i32;
30}
31
32int main(int argc, char* argv[])
33{
34    _lpython_set_argv(argc, argv);
35    int32_t __libasr_index_0_;
36    struct r32 a_value;
37    struct r32* a = &a_value;
38    float *a_data;
39    a->data = a_data;
40    a->n_dims = 1;
41    a->offset = 0;
42    a->dims[0].lower_bound = 1;
43    a->dims[0].length = 0;
44    a->dims[0].stride = 1;
45    struct r32 b_value;
46    struct r32* b = &b_value;
47    float *b_data;
48    b->data = b_data;
49    b->n_dims = 1;
50    b->offset = 0;
51    b->dims[0].lower_bound = 1;
52    b->dims[0].length = 0;
53    b->dims[0].stride = 1;
54    int32_t i;
55    a->n_dims = 1;
56    a->dims[0].lower_bound = 1;
57    a->dims[0].length = 10000000;
58    a->dims[0].stride = 1;
59    a->data = (float*) _lfortran_malloc(1*a->dims[0].length*sizeof(float));
60    a->is_allocated = true;
61    b->n_dims = 1;
62    b->dims[0].lower_bound = 1;
63    b->dims[0].length = 10000000;
64    b->dims[0].stride = 1;
65    b->data = (float*) _lfortran_malloc(1*b->dims[0].length*sizeof(float));
66    b->is_allocated = true;
67    for (__libasr_index_0_=((int32_t)b->dims[1-1].lower_bound); __libasr_index_0_<=((int32_t) b->dims[1-1].length + b->dims[1-1].lower_bound - 1); __libasr_index_0_++) {
68        b->data[((0 + (b->dims[0].stride * (__libasr_index_0_ - b->dims[0].lower_bound))) + b->offset)] = (float)(5);
69    }
70#pragma omp target  map(tofrom: a->data[0:a->dims[0].length-1]) map(to: a->dims[0].lower_bound, a->dims[0].length, a->dims[0].stride) map(to: a->n_dims) map(from: a->offset) map(tofrom: b->data[0:b->dims[0].length-1]) map(to: b->dims[0].lower_bound, b->dims[0].length, b->dims[0].stride) map(to: b->n_dims) map(from: b->offset)
71#pragma omp teams
72#pragma omp distribute parallel for
73    for (i=1; i<=10000000; i++) {
74        a->data[((0 + (a->dims[0].stride * (i - a->dims[0].lower_bound))) + a->offset)] = _lcompilers_real_i32(i) + b->data[((0 + (b->dims[0].stride * (i - b->dims[0].lower_bound))) + b->offset)]*(float)(340);
75    }
76
77    printf("%f%s%f
78", a->data[((0 + (a->dims[0].stride * (5 - a->dims[0].lower_bound))) + a->offset)], " ", b->data[((0 + (b->dims[0].stride * (5 - b->dims[0].lower_bound))) + b->offset)]);
79    if (a->data[((0 + (a->dims[0].stride * (5 - a->dims[0].lower_bound))) + a->offset)] != (float)(1705)) {
80        fprintf(stderr, "ERROR STOP");
81        exit(1);
82    }
83    if (b->data[((0 + (b->dims[0].stride * (5 - b->dims[0].lower_bound))) + b->offset)] != (float)(5)) {
84        fprintf(stderr, "ERROR STOP");
85        exit(1);
86    }
87    // FIXME: implicit deallocate(a, b, );
88    return 0;
89}

The C code shows how arrays a and b are handled as struct r32 with data mapped as tofrom and descriptors like length and stride as to. The loop is offloaded using OpenMP directives, and the result is verified.

Next Steps

For Week 11, I plan to:

  • Create a PR with the C-backend changes once CI testing is feasible.
  • Explore another way for testing of Target Offloading.
  • Test more complex offloading examples, including multiple arrays and clauses.

I am thankful to my mentors, Ondrej Certik, Pranav Goswami, and Gaurav Dhingra, for their guidance as I navigated this complex setup. I also appreciate the LFortran community’s support during this learning phase.