Following Week 10’s extension of the C-backend for target offloading, Week 11 focused on finding a way to test this on CI. After a fruitful discussion with my mentors Ondrej and Pranav, we agreed on a dual-mode approach for CPU and GPU testing. I spent about 32 hours adapting the C-backend to generate CUDA-specific code while maintaining CPU compatibility, ensuring the code works on both platforms with a simple switch.

Discussion and Strategy with Mentors

This week, I discussed with mentors to figure out how to test target offloading on CI, since GitHub Actions lacks GPU access. Our conclusion was to generate C-code with API calls, provide our own CPU implementations for CI testing (using a CPU_ENABLED switch), and use CUDA runtime for GPU testing (with a GPU_ENABLED switch). The plan was to extend the C-backend to produce CUDA code equivalent to the target, teams, and distribute parallel do constructs, including map clauses, while keeping CPU compatibility.

Implementation Approach

I started by building on last week’s C-backend output, which used OpenMP pragmas like target. I added a --target-offload flag to LFortran. When I run lfortran --openmp --show-c, it generates the usual OpenMP C code. With lfortran --openmp --show-c --target-offload, it produces CUDA-specific code. This keeps backward compatibility and lets me switch modes easily.

For the CUDA code, I first handwrote a version to test the concept, using LFortran’s array struct (struct r32 for floats) and ensuring the results matched. Initially, I thought to transfer each struct member (like data, dims) to the GPU separately, but this was inefficient and required editing every statement in the kernel. Instead, I copied the whole struct to the device using cudaMalloc and cudaMemcpy, updating only the data pointer to point to device memory. This simplified the code generation process.

Code Structure and Dual-Mode Support

The generated code uses a macro USE_GPU to switch between GPU and CPU modes. Here’s how it works:

  • GPU Mode (USE_GPU defined): Includes cuda_runtime.h, uses CUDA functions like cudaMalloc and cudaLaunchKernel, and compiles with nvcc. The kernel is marked __global__, and thread and block IDs are set using CUDA’s grid and block dimensions.
  • CPU Mode (USE_GPU not defined): Includes cpu_impl.h, which provides CPU implementations of CUDA functions using OpenMP. For example, cudaLaunchKernel emulates parallelism with #pragma omp parallel, adjusting blockIdx, threadIdx, blockDim, and gridDim to mimic CUDA behavior.

The kernel function compute_kernel_0 is defined with __global__ for GPU or as a regular function for CPU. Memory management uses cudaMalloc and cudaMemcpy on GPU, or malloc and memcpy on CPU, tracked in a memory_tracker_t structure.

View MRE for Target Offloading (openmp_70.f90)
1program openmp_70
2    implicit none
3    real, allocatable, dimension(:) :: a, b
4    integer :: i
5    allocate(a(10000000), b(10000000))
6    b=5
7    !$omp target map(tofrom:a, b)
8        !$omp teams
9            !$omp distribute parallel do
10                do i = 1, 10000000
11                    a(i) = i + b(i)*340
12                end do
13            !$omp end distribute parallel do
14        !$omp end teams
15    !$omp end target
16    print*, a(5), b(5)
17    if(a(5) /= 1705) error stop
18    if(b(5) /= 5) error stop
19end program openmp_70
View Generated C Code with OpenMP (with lfortran --openmp --show-c)
1#include <inttypes.h>
2#include <stdlib.h>
3#include <stdbool.h>
4#include <stdio.h>
5#include <string.h>
6#include <lfortran_intrinsics.h>
7
8struct dimension_descriptor
9{
10    int32_t lower_bound, length, stride;
11};
12
13struct r32
14{
15    float *data;
16    struct dimension_descriptor dims[32];
17    int32_t n_dims;
18    int32_t offset;
19    bool is_allocated;
20};
21
22// Implementations
23float _lcompilers_real_i32(int32_t x)
24{
25    float _lcompilers_real_i32;
26    _lcompilers_real_i32 = (float)(x);
27    return _lcompilers_real_i32;
28}
29
30int main(int argc, char* argv[])
31{
32    _lpython_set_argv(argc, argv);
33    int32_t __libasr_index_0_;
34    struct r32 a_value;
35    struct r32* a = &a_value;
36    float *a_data;
37    a->data = a_data;
38    a->n_dims = 1;
39    a->offset = 0;
40    a->dims[0].lower_bound = 1;
41    a->dims[0].length = 0;
42    a->dims[0].stride = 1;
43    struct r32 b_value;
44    struct r32* b = &b_value;
45    float *b_data;
46    b->data = b_data;
47    b->n_dims = 1;
48    b->offset = 0;
49    b->dims[0].lower_bound = 1;
50    b->dims[0].length = 0;
51    b->dims[0].stride = 1;
52    int32_t i;
53    a->n_dims = 1;
54    a->dims[0].lower_bound = 1;
55    a->dims[0].length = 10000000;
56    a->dims[0].stride = 1;
57    a->data = (float*) _lfortran_malloc(1*a->dims[0].length*sizeof(float));
58    a->is_allocated = true;
59    b->n_dims = 1;
60    b->dims[0].lower_bound = 1;
61    b->dims[0].length = 10000000;
62    b->dims[0].stride = 1;
63    b->data = (float*) _lfortran_malloc(1*b->dims[0].length*sizeof(float));
64    b->is_allocated = true;
65    for (__libasr_index_0_=((int32_t)b->dims[1-1].lower_bound); __libasr_index_0_<=((int32_t) b->dims[1-1].length + b->dims[1-1].lower_bound - 1); __libasr_index_0_++) {
66        b->data[((0 + (b->dims[0].stride * (__libasr_index_0_ - b->dims[0].lower_bound))) + b->offset)] = (float)(5);
67    }
68#pragma omp target map(tofrom:a, b)
69#pragma omp teams
70#pragma omp distribute parallel for
71    for (i=1; i<=10000000; i++) {
72        a->data[((0 + (a->dims[0].stride * (i - a->dims[0].lower_bound))) + a->offset)] = _lcompilers_real_i32(i) + b->data[((0 + (b->dims[0].stride * (i - b->dims[0].lower_bound))) + b->offset)]*(float)(340);
73    }
74    printf("%f%s%f
75", a->data[((0 + (a->dims[0].stride * (5 - a->dims[0].lower_bound))) + a->offset)], " ", b->data[((0 + (b->dims[0].stride * (5 - b->dims[0].lower_bound))) + b->offset)]);
76    if (a->data[((0 + (a->dims[0].stride * (5 - a->dims[0].lower_bound))) + a->offset)] != (float)(1705)) {
77        fprintf(stderr, "ERROR STOP");
78        exit(1);
79    }
80    if (b->data[((0 + (b->dims[0].stride * (5 - b->dims[0].lower_bound))) + b->offset)] != (float)(5)) {
81        fprintf(stderr, "ERROR STOP");
82        exit(1);
83    }
84    // FIXME: implicit deallocate(a, b, );
85    return 0;
86}
87
View Generated C Code with CUDA (omp_off_gen.c) (with lfortran --openmp --show-c --target-offload)
1#include <inttypes.h>
2#include <stdlib.h>
3#include <stdbool.h>
4#include <stdio.h>
5#include <string.h>
6#include <lfortran_intrinsics.h>
7#ifdef USE_GPU
8#include<cuda_runtime.h>
9#else
10#include"cpu_impl.h"
11#endif
12struct dimension_descriptor
13{
14    int32_t lower_bound, length, stride;
15};
16struct r32
17{
18    float *data;
19    struct dimension_descriptor dims[32];
20    int32_t n_dims;
21    int32_t offset;
22    bool is_allocated;
23};
24// Implementations
25#ifdef USE_GPU
26__global__
27#endif
28void compute_kernel_0(struct r32 *a, struct r32 *b, int i_n) {
29    int i = blockIdx.x * blockDim.x + threadIdx.x + 1;
30    if (i <= i_n) {
31            a->data[((0 + (a->dims[0].stride * (i - a->dims[0].lower_bound))) + a->offset)] = (float)(i) + b->data[((0 + (b->dims[0].stride * (i - b->dims[0].lower_bound))) + b->offset)]*(float)(340);
32    }
33}
34#ifndef USE_GPU
35void compute_kernel_0_wrapper(void **args) {
36    struct r32 *a = *(struct r32**)args[0];
37    struct r32 *b = *(struct r32**)args[1];
38    int i_n = *(int*)args[2];
39    compute_kernel_0(a, b, i_n);
40}
41#endif
42#ifndef USE_GPU
43void compute_kernel_wrapper(void **args, void *func) {
44    if (func == (void*)compute_kernel_0) {
45        compute_kernel_0_wrapper(args);
46        return;
47    }
48    fprintf(stderr, "Unknown kernel function
49");
50    exit(1);
51}
52#endif
53int main(int argc, char* argv[])
54{
55    _lpython_set_argv(argc, argv);
56    int32_t __libasr_index_0_;
57    struct r32 a_value;
58    struct r32* a = &a_value;
59    float *a_data;
60    a->data = a_data;
61    a->n_dims = 1;
62    a->offset = 0;
63    a->dims[0].lower_bound = 1;
64    a->dims[0].length = 0;
65    a->dims[0].stride = 1;
66    struct r32 b_value;
67    struct r32* b = &b_value;
68    float *b_data;
69    b->data = b_data;
70    b->n_dims = 1;
71    b->offset = 0;
72    b->dims[0].lower_bound = 1;
73    b->dims[0].length = 0;
74    b->dims[0].stride = 1;
75    int32_t i;
76    a->n_dims = 1;
77    a->dims[0].lower_bound = 1;
78    a->dims[0].length = 10000000;
79    a->dims[0].stride = 1;
80    a->data = (float*) _lfortran_malloc(1*a->dims[0].length*sizeof(float));
81    a->is_allocated = true;
82    b->n_dims = 1;
83    b->dims[0].lower_bound = 1;
84    b->dims[0].length = 10000000;
85    b->dims[0].stride = 1;
86    b->data = (float*) _lfortran_malloc(1*b->dims[0].length*sizeof(float));
87    b->is_allocated = true;
88    for (__libasr_index_0_=((int32_t)b->dims[1-1].lower_bound); __libasr_index_0_<=((int32_t) b->dims[1-1].length + b->dims[1-1].lower_bound - 1); __libasr_index_0_++) {
89        b->data[((0 + (b->dims[0].stride * (__libasr_index_0_ - b->dims[0].lower_bound))) + b->offset)] = (float)(5);
90    }
91    float *d_a_data = NULL;
92    float *d_b_data = NULL;
93    cudaError_t err;
94    size_t a_data_size = a->dims[0].length * sizeof(float);
95    err = cudaMalloc((void**)&d_a_data, a_data_size);
96    if (err != cudaSuccess) {
97        fprintf(stderr, "cudaMalloc failed for a_data: %s", cudaGetErrorString(err));
98        exit(1);
99    }
100    size_t b_data_size = b->dims[0].length * sizeof(float);
101    err = cudaMalloc((void**)&d_b_data, b_data_size);
102    if (err != cudaSuccess) {
103        fprintf(stderr, "cudaMalloc failed for b_data: %s", cudaGetErrorString(err));
104        exit(1);
105    }
106    err = cudaMemcpy(d_a_data, a->data, a_data_size, cudaMemcpyHostToDevice);
107    if (err != cudaSuccess) {
108        fprintf(stderr, "cudaMemcpy H2D failed for a_data: %s", cudaGetErrorString(err));
109        exit(1);
110    }
111    err = cudaMemcpy(d_b_data, b->data, b_data_size, cudaMemcpyHostToDevice);
112    if (err != cudaSuccess) {
113        fprintf(stderr, "cudaMemcpy H2D failed for b_data: %s", cudaGetErrorString(err));
114        exit(1);
115    }
116    struct r32 h_a_copy = *a;
117    h_a_copy.data = d_a_data;
118    struct r32 h_b_copy = *b;
119    h_b_copy.data = d_b_data;
120    struct r32 *d_a_struct = NULL;
121    err = cudaMalloc((void**)&d_a_struct, sizeof(struct r32));
122    if (err != cudaSuccess) {
123        fprintf(stderr, "cudaMalloc failed for d_a_struct: %s", cudaGetErrorString(err));
124        exit(1);
125    }
126    struct r32 *d_b_struct = NULL;
127    err = cudaMalloc((void**)&d_b_struct, sizeof(struct r32));
128    if (err != cudaSuccess) {
129        fprintf(stderr, "cudaMalloc failed for d_b_struct: %s", cudaGetErrorString(err));
130        exit(1);
131    }
132    err = cudaMemcpy(d_a_struct, &h_a_copy, sizeof(struct r32), cudaMemcpyHostToDevice);
133    if (err != cudaSuccess) {
134        fprintf(stderr, "cudaMemcpy H2D failed for d_a_struct: %s", cudaGetErrorString(err));
135        exit(1);
136    }
137    err = cudaMemcpy(d_b_struct, &h_b_copy, sizeof(struct r32), cudaMemcpyHostToDevice);
138    if (err != cudaSuccess) {
139        fprintf(stderr, "cudaMemcpy H2D failed for d_b_struct: %s", cudaGetErrorString(err));
140        exit(1);
141    }
142    int i_n = 10000000;
143    int threads_per_block = 256;
144    int blocks = (i_n + threads_per_block - 1) / threads_per_block;
145    dim3 grid_dim = {blocks, 1, 1};
146    dim3 block_dim = {threads_per_block, 1, 1};
147    void *kernel_args[] = {&d_a_struct, &d_b_struct, &i_n};
148    err = cudaLaunchKernel((void*)compute_kernel_0, grid_dim, block_dim, kernel_args, 0, NULL);
149    if (err != cudaSuccess) {
150        fprintf(stderr, "cudaLaunchKernel failed: %s", cudaGetErrorString(err));
151        exit(1);
152    }
153    err = cudaDeviceSynchronize();
154    if (err != cudaSuccess) {
155        fprintf(stderr, "cudaDeviceSynchronize failed: %s", cudaGetErrorString(err));
156        exit(1);
157    }
158    err = cudaMemcpy(a->data, d_a_data, a_data_size, cudaMemcpyDeviceToHost);
159    if (err != cudaSuccess) {
160        fprintf(stderr, "cudaMemcpy D2H failed for a_data: %s", cudaGetErrorString(err));
161        exit(1);
162    }
163    err = cudaMemcpy(b->data, d_b_data, b_data_size, cudaMemcpyDeviceToHost);
164    if (err != cudaSuccess) {
165        fprintf(stderr, "cudaMemcpy D2H failed for b_data: %s", cudaGetErrorString(err));
166        exit(1);
167    }
168    cudaFree(d_a_data);
169    cudaFree(d_a_struct);
170    cudaFree(d_b_data);
171    cudaFree(d_b_struct);
172    printf("%f%s%f", a->data[((0 + (a->dims[0].stride * (5 - a->dims[0].lower_bound))) + a->offset)], " ", b->data[((0 + (b->dims[0].stride * (5 - b->dims[0].lower_bound))) + b->offset)]);
173    if (a->data[((0 + (a->dims[0].stride * (5 - a->dims[0].lower_bound))) + a->offset)] != (float)(1705)) {
174        fprintf(stderr, "ERROR STOP");
175        exit(1);
176    }
177    if (b->data[((0 + (b->dims[0].stride * (5 - b->dims[0].lower_bound))) + b->offset)] != (float)(5)) {
178        fprintf(stderr, "ERROR STOP");
179        exit(1);
180    }
181    // FIXME: implicit deallocate(a, b, );
182    return 0;
183}
184
View CPU Implementation Header (cpu_impl.h)
1#ifndef CPU_IMPL_H
2#define CPU_IMPL_H
3#include <stdio.h>
4#include <stdlib.h>
5#include <string.h>
6#include <omp.h>
7#include <math.h>
8// CUDA Runtime API Emulation for CPU
9typedef enum {
10    cudaSuccess = 0,
11    cudaErrorMemoryAllocation = 2,
12    cudaErrorInvalidValue = 11
13} cudaError_t;
14// Device execution configuration
15typedef struct {
16    unsigned int x, y, z;
17} dim3;
18// Thread and block index emulation
19typedef struct {
20    unsigned int x, y, z;
21} uint3;
22// Global thread identifiers (CPU emulation)
23extern __thread uint3 threadIdx;
24extern __thread uint3 blockIdx;
25extern __thread dim3 blockDim;
26extern __thread dim3 gridDim;
27// Memory management API
28cudaError_t cudaMalloc(void **devPtr, size_t size);
29cudaError_t cudaFree(void *devPtr);
30cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, int kind);
31cudaError_t cudaDeviceSynchronize(void);
32// Memory copy kinds
33#define cudaMemcpyHostToDevice 1
34#define cudaMemcpyDeviceToHost 2
35#define cudaMemcpyDeviceToDevice 3
36// Kernel launch emulation - NOTE: Changed function signature
37cudaError_t cudaLaunchKernel(void *func, dim3 gridDim, dim3 blockDim,
38                            void **args, size_t sharedMem, void *stream);
39// Error handling
40const char* cudaGetErrorString(cudaError_t error);
41// Device synchronization
42#define __syncthreads() _Pragma("omp barrier")
43// Memory allocation tracking structure
44typedef struct {
45    void *cpu_ptr;
46    void *device_ptr;
47    size_t size;
48    int is_allocated;
49} memory_tracker_t;
50// Initialization function
51void cpu_runtime_init(void);
52void cpu_runtime_cleanup(void);
53#endif // CPU_IMPL_H
54
View CPU Implementation Source (cpu_impl.c)
1#include "cpu_impl.h"
2// Thread-local storage for CUDA-like thread coordinates
3__thread uint3 threadIdx = {0, 0, 0};
4__thread uint3 blockIdx = {0, 0, 0};
5__thread dim3 blockDim = {1, 1, 1};
6__thread dim3 gridDim = {1, 1, 1};
7int counts=0;
8// Memory tracking table
9#define MAX_ALLOCATIONS 1024
10memory_tracker_t memory_table[MAX_ALLOCATIONS];
11int memory_count = 0;
12// CPU Runtime initialization
13void cpu_runtime_init(void) {
14    memory_count = 0;
15    for (int i = 0; i < 1024; i++) {
16        memory_table[i].cpu_ptr = NULL;
17        memory_table[i].device_ptr = NULL;
18        memory_table[i].size = 0;
19        memory_table[i].is_allocated = 0;
20    }
21}
22void cpu_runtime_cleanup(void) {
23    for (int i = 0; i < memory_count; i++) {
24        if (memory_table[i].is_allocated && memory_table[i].cpu_ptr) {
25            free(memory_table[i].cpu_ptr);
26            memory_table[i].is_allocated = 0;
27        }
28    }
29    memory_count = 0;
30}
31// CUDA Memory Management API Emulation
32cudaError_t cudaMalloc(void **devPtr, size_t size) {
33    if(memory_count > MAX_ALLOCATIONS) {
34        fprintf(stderr, "Error: Exceeded maximum memory allocations (%d)
35", MAX_ALLOCATIONS);
36        return cudaErrorMemoryAllocation;
37    }
38    void *ptr = malloc(size);
39    if (!ptr) {
40        return cudaErrorMemoryAllocation;
41    }
42    *devPtr = ptr;
43    memory_table[memory_count].cpu_ptr = ptr;
44    memory_table[memory_count].device_ptr = ptr; // Same on CPU
45    memory_table[memory_count].size = size;
46    memory_table[memory_count].is_allocated = 1;
47    memory_count++;
48    return cudaSuccess;
49}
50cudaError_t cudaFree(void *devPtr) {
51    for (int i = 0; i < memory_count; i++) {
52        if (memory_table[i].device_ptr == devPtr && memory_table[i].is_allocated) {
53            free(memory_table[i].cpu_ptr);
54            memory_table[i].is_allocated = 0;
55            return cudaSuccess;
56        }
57    }
58    return cudaErrorInvalidValue;
59}
60cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, int kind) {
61    memcpy(dst, src, count);
62    return cudaSuccess;
63}
64cudaError_t cudaDeviceSynchronize(void) {
65    return cudaSuccess;
66}
67// Forward declaration for the kernel wrapper
68void compute_kernel_wrapper(void **args, void *func);
69// kernel execution emulation
70cudaError_t cudaLaunchKernel(void *func, dim3 grid_dim, dim3 block_dim,
71                            void **args, size_t sharedMem, void *stream) {
72    long long total_blocks = grid_dim.x * grid_dim.y * grid_dim.z;
73    long long threads_per_block = block_dim.x * block_dim.y * block_dim.z;
74    long long total_threads = total_blocks * threads_per_block;
75    long long max_omp_threads = omp_get_max_threads();
76    long long threads_to_use = (total_blocks < max_omp_threads) ? total_blocks : max_omp_threads;
77    #pragma omp parallel num_threads(threads_to_use)
78    {
79        long long omp_thread_id = omp_get_thread_num();
80        long long num_omp_threads = omp_get_num_threads();
81        for (long long block_id = omp_thread_id; block_id < total_blocks; block_id += num_omp_threads) {
82            long long bx = block_id % grid_dim.x;
83            long long by = (block_id / grid_dim.x) % grid_dim.y;
84            long long bz = block_id / (grid_dim.x * grid_dim.y);
85            for (long long thread_in_block = 0; thread_in_block < threads_per_block; thread_in_block++) {
86                blockIdx.x = bx;
87                blockIdx.y = by;
88                blockIdx.z = bz;
89                threadIdx.x = thread_in_block % block_dim.x;
90                threadIdx.y = (thread_in_block / block_dim.x) % block_dim.y;
91                threadIdx.z = thread_in_block / (block_dim.x * block_dim.y);
92                blockDim = block_dim;
93                gridDim = grid_dim;
94                compute_kernel_wrapper(args, func);
95            }
96        }
97    }
98    return cudaSuccess;
99}
100// Error handling
101const char* cudaGetErrorString(cudaError_t error) {
102    switch (error) {
103        case cudaSuccess:
104            return "cudaSuccess";
105        case cudaErrorMemoryAllocation:
106            return "cudaErrorMemoryAllocation";
107        case cudaErrorInvalidValue:
108            return "cudaErrorInvalidValue";
109        default:
110            return "Unknown CUDA error";
111    }
112}
113

Running the Generated C-Code in Both Modes

The code can run in two modes depending on the compiler and flags:

  • GPU Mode: Compile with nvcc and define USE_GPU to use the CUDA runtime.
    Example commands:
    $ gcc -I/lfortran/src/libasr/runtime -I/lfortran/src/ -c /lfortran/src/libasr/runtime/lfortran_intrinsics.c -o intrinsic.o
    $ nvcc -O2 -x cu -DUSE_GPU -I/lfortran/src/libasr/runtime -I/lfortran/src/ -c omp_off_gen.c -o omp_off_gen.o
    $ nvcc intrinsic.o omp_off_gen.o -lm -o a
    $ ./a
    This offloads the computation to the GPU, using CUDA functions for memory management and kernel execution.
  • CPU Mode: Compile with gcc and include cpu_impl.c for CPU emulation.
    Example commands:
    $ gcc -I/lfortran/src/libasr/runtime -I/lfortran/src/ -c /lfortran/src/libasr/runtime/lfortran_intrinsics.c -o intrinsic.o
    $ gcc -fopenmp -I/lfortran/src/libasr/runtime -I/lfortran/src/ cpu_impl.c omp_off_gen.c intrinsic.o -lm -o a
    $ ./a
    This runs the code on the CPU, using OpenMP for parallelism and emulating CUDA behavior with the provided library.

Both modes produce the same result (a(5) = 1705, b(5) = 5), ensuring consistency across platforms.

Next Steps

For Week 12, I plan to:

  • Prepare a PR with the dual-mode C-backend changes once CI testing is resolved.
  • Extend support for more complex offloading scenarios, like nested regions.

I thank my mentors, Ondrej Certik, Pranav Goswami, and Gaurav Dhingra, for their valuable insights during our meeting. I also appreciate the LFortran community’s support as I work on this exciting feature.