Following Week 10’s extension of the C-backend for target offloading, Week 11 focused on finding a way to test this on CI. After a fruitful discussion with my mentors Ondrej and Pranav, we agreed on a dual-mode approach for CPU and GPU testing. I spent about 32 hours adapting the C-backend to generate CUDA-specific code while maintaining CPU compatibility, ensuring the code works on both platforms with a simple switch.
Discussion and Strategy with Mentors
This week, I discussed with mentors to figure out how to test target offloading on CI, since GitHub Actions lacks GPU access. Our conclusion was to generate C-code with API calls, provide our own CPU implementations for CI testing (using a CPU_ENABLED
switch), and use CUDA runtime for GPU testing (with a GPU_ENABLED
switch). The plan was to extend the C-backend to produce CUDA code equivalent to the target
, teams
, and distribute parallel do
constructs, including map
clauses, while keeping CPU compatibility.
Implementation Approach
I started by building on last week’s C-backend output, which used OpenMP pragmas like target
. I added a --target-offload
flag to LFortran. When I run lfortran --openmp --show-c
, it generates the usual OpenMP C code. With lfortran --openmp --show-c --target-offload
, it produces CUDA-specific code. This keeps backward compatibility and lets me switch modes easily.
For the CUDA code, I first handwrote a version to test the concept, using LFortran’s array struct (struct r32
for floats) and ensuring the results matched. Initially, I thought to transfer each struct member (like data
, dims
) to the GPU separately, but this was inefficient and required editing every statement in the kernel. Instead, I copied the whole struct to the device using cudaMalloc
and cudaMemcpy
, updating only the data
pointer to point to device memory. This simplified the code generation process.
Code Structure and Dual-Mode Support
The generated code uses a macro USE_GPU
to switch between GPU and CPU modes. Here’s how it works:
- GPU Mode (
USE_GPU
defined): Includescuda_runtime.h
, uses CUDA functions likecudaMalloc
andcudaLaunchKernel
, and compiles withnvcc
. The kernel is marked__global__
, and thread and block IDs are set using CUDA’s grid and block dimensions. - CPU Mode (
USE_GPU
not defined): Includescpu_impl.h
, which provides CPU implementations of CUDA functions using OpenMP. For example,cudaLaunchKernel
emulates parallelism with#pragma omp parallel
, adjustingblockIdx
,threadIdx
,blockDim
, andgridDim
to mimic CUDA behavior.
The kernel function compute_kernel_0
is defined with __global__
for GPU or as a regular function for CPU. Memory management uses cudaMalloc
and cudaMemcpy
on GPU, or malloc
and memcpy
on CPU, tracked in a memory_tracker_t
structure.
View MRE for Target Offloading (openmp_70.f90)
1program openmp_70
2 implicit none
3 real, allocatable, dimension(:) :: a, b
4 integer :: i
5 allocate(a(10000000), b(10000000))
6 b=5
7 !$omp target map(tofrom:a, b)
8 !$omp teams
9 !$omp distribute parallel do
10 do i = 1, 10000000
11 a(i) = i + b(i)*340
12 end do
13 !$omp end distribute parallel do
14 !$omp end teams
15 !$omp end target
16 print*, a(5), b(5)
17 if(a(5) /= 1705) error stop
18 if(b(5) /= 5) error stop
19end program openmp_70
View Generated C Code with OpenMP (with lfortran --openmp --show-c)
1#include <inttypes.h>
2#include <stdlib.h>
3#include <stdbool.h>
4#include <stdio.h>
5#include <string.h>
6#include <lfortran_intrinsics.h>
7
8struct dimension_descriptor
9{
10 int32_t lower_bound, length, stride;
11};
12
13struct r32
14{
15 float *data;
16 struct dimension_descriptor dims[32];
17 int32_t n_dims;
18 int32_t offset;
19 bool is_allocated;
20};
21
22// Implementations
23float _lcompilers_real_i32(int32_t x)
24{
25 float _lcompilers_real_i32;
26 _lcompilers_real_i32 = (float)(x);
27 return _lcompilers_real_i32;
28}
29
30int main(int argc, char* argv[])
31{
32 _lpython_set_argv(argc, argv);
33 int32_t __libasr_index_0_;
34 struct r32 a_value;
35 struct r32* a = &a_value;
36 float *a_data;
37 a->data = a_data;
38 a->n_dims = 1;
39 a->offset = 0;
40 a->dims[0].lower_bound = 1;
41 a->dims[0].length = 0;
42 a->dims[0].stride = 1;
43 struct r32 b_value;
44 struct r32* b = &b_value;
45 float *b_data;
46 b->data = b_data;
47 b->n_dims = 1;
48 b->offset = 0;
49 b->dims[0].lower_bound = 1;
50 b->dims[0].length = 0;
51 b->dims[0].stride = 1;
52 int32_t i;
53 a->n_dims = 1;
54 a->dims[0].lower_bound = 1;
55 a->dims[0].length = 10000000;
56 a->dims[0].stride = 1;
57 a->data = (float*) _lfortran_malloc(1*a->dims[0].length*sizeof(float));
58 a->is_allocated = true;
59 b->n_dims = 1;
60 b->dims[0].lower_bound = 1;
61 b->dims[0].length = 10000000;
62 b->dims[0].stride = 1;
63 b->data = (float*) _lfortran_malloc(1*b->dims[0].length*sizeof(float));
64 b->is_allocated = true;
65 for (__libasr_index_0_=((int32_t)b->dims[1-1].lower_bound); __libasr_index_0_<=((int32_t) b->dims[1-1].length + b->dims[1-1].lower_bound - 1); __libasr_index_0_++) {
66 b->data[((0 + (b->dims[0].stride * (__libasr_index_0_ - b->dims[0].lower_bound))) + b->offset)] = (float)(5);
67 }
68#pragma omp target map(tofrom:a, b)
69#pragma omp teams
70#pragma omp distribute parallel for
71 for (i=1; i<=10000000; i++) {
72 a->data[((0 + (a->dims[0].stride * (i - a->dims[0].lower_bound))) + a->offset)] = _lcompilers_real_i32(i) + b->data[((0 + (b->dims[0].stride * (i - b->dims[0].lower_bound))) + b->offset)]*(float)(340);
73 }
74 printf("%f%s%f
75", a->data[((0 + (a->dims[0].stride * (5 - a->dims[0].lower_bound))) + a->offset)], " ", b->data[((0 + (b->dims[0].stride * (5 - b->dims[0].lower_bound))) + b->offset)]);
76 if (a->data[((0 + (a->dims[0].stride * (5 - a->dims[0].lower_bound))) + a->offset)] != (float)(1705)) {
77 fprintf(stderr, "ERROR STOP");
78 exit(1);
79 }
80 if (b->data[((0 + (b->dims[0].stride * (5 - b->dims[0].lower_bound))) + b->offset)] != (float)(5)) {
81 fprintf(stderr, "ERROR STOP");
82 exit(1);
83 }
84 // FIXME: implicit deallocate(a, b, );
85 return 0;
86}
87
View Generated C Code with CUDA (omp_off_gen.c) (with lfortran --openmp --show-c --target-offload)
1#include <inttypes.h>
2#include <stdlib.h>
3#include <stdbool.h>
4#include <stdio.h>
5#include <string.h>
6#include <lfortran_intrinsics.h>
7#ifdef USE_GPU
8#include<cuda_runtime.h>
9#else
10#include"cpu_impl.h"
11#endif
12struct dimension_descriptor
13{
14 int32_t lower_bound, length, stride;
15};
16struct r32
17{
18 float *data;
19 struct dimension_descriptor dims[32];
20 int32_t n_dims;
21 int32_t offset;
22 bool is_allocated;
23};
24// Implementations
25#ifdef USE_GPU
26__global__
27#endif
28void compute_kernel_0(struct r32 *a, struct r32 *b, int i_n) {
29 int i = blockIdx.x * blockDim.x + threadIdx.x + 1;
30 if (i <= i_n) {
31 a->data[((0 + (a->dims[0].stride * (i - a->dims[0].lower_bound))) + a->offset)] = (float)(i) + b->data[((0 + (b->dims[0].stride * (i - b->dims[0].lower_bound))) + b->offset)]*(float)(340);
32 }
33}
34#ifndef USE_GPU
35void compute_kernel_0_wrapper(void **args) {
36 struct r32 *a = *(struct r32**)args[0];
37 struct r32 *b = *(struct r32**)args[1];
38 int i_n = *(int*)args[2];
39 compute_kernel_0(a, b, i_n);
40}
41#endif
42#ifndef USE_GPU
43void compute_kernel_wrapper(void **args, void *func) {
44 if (func == (void*)compute_kernel_0) {
45 compute_kernel_0_wrapper(args);
46 return;
47 }
48 fprintf(stderr, "Unknown kernel function
49");
50 exit(1);
51}
52#endif
53int main(int argc, char* argv[])
54{
55 _lpython_set_argv(argc, argv);
56 int32_t __libasr_index_0_;
57 struct r32 a_value;
58 struct r32* a = &a_value;
59 float *a_data;
60 a->data = a_data;
61 a->n_dims = 1;
62 a->offset = 0;
63 a->dims[0].lower_bound = 1;
64 a->dims[0].length = 0;
65 a->dims[0].stride = 1;
66 struct r32 b_value;
67 struct r32* b = &b_value;
68 float *b_data;
69 b->data = b_data;
70 b->n_dims = 1;
71 b->offset = 0;
72 b->dims[0].lower_bound = 1;
73 b->dims[0].length = 0;
74 b->dims[0].stride = 1;
75 int32_t i;
76 a->n_dims = 1;
77 a->dims[0].lower_bound = 1;
78 a->dims[0].length = 10000000;
79 a->dims[0].stride = 1;
80 a->data = (float*) _lfortran_malloc(1*a->dims[0].length*sizeof(float));
81 a->is_allocated = true;
82 b->n_dims = 1;
83 b->dims[0].lower_bound = 1;
84 b->dims[0].length = 10000000;
85 b->dims[0].stride = 1;
86 b->data = (float*) _lfortran_malloc(1*b->dims[0].length*sizeof(float));
87 b->is_allocated = true;
88 for (__libasr_index_0_=((int32_t)b->dims[1-1].lower_bound); __libasr_index_0_<=((int32_t) b->dims[1-1].length + b->dims[1-1].lower_bound - 1); __libasr_index_0_++) {
89 b->data[((0 + (b->dims[0].stride * (__libasr_index_0_ - b->dims[0].lower_bound))) + b->offset)] = (float)(5);
90 }
91 float *d_a_data = NULL;
92 float *d_b_data = NULL;
93 cudaError_t err;
94 size_t a_data_size = a->dims[0].length * sizeof(float);
95 err = cudaMalloc((void**)&d_a_data, a_data_size);
96 if (err != cudaSuccess) {
97 fprintf(stderr, "cudaMalloc failed for a_data: %s", cudaGetErrorString(err));
98 exit(1);
99 }
100 size_t b_data_size = b->dims[0].length * sizeof(float);
101 err = cudaMalloc((void**)&d_b_data, b_data_size);
102 if (err != cudaSuccess) {
103 fprintf(stderr, "cudaMalloc failed for b_data: %s", cudaGetErrorString(err));
104 exit(1);
105 }
106 err = cudaMemcpy(d_a_data, a->data, a_data_size, cudaMemcpyHostToDevice);
107 if (err != cudaSuccess) {
108 fprintf(stderr, "cudaMemcpy H2D failed for a_data: %s", cudaGetErrorString(err));
109 exit(1);
110 }
111 err = cudaMemcpy(d_b_data, b->data, b_data_size, cudaMemcpyHostToDevice);
112 if (err != cudaSuccess) {
113 fprintf(stderr, "cudaMemcpy H2D failed for b_data: %s", cudaGetErrorString(err));
114 exit(1);
115 }
116 struct r32 h_a_copy = *a;
117 h_a_copy.data = d_a_data;
118 struct r32 h_b_copy = *b;
119 h_b_copy.data = d_b_data;
120 struct r32 *d_a_struct = NULL;
121 err = cudaMalloc((void**)&d_a_struct, sizeof(struct r32));
122 if (err != cudaSuccess) {
123 fprintf(stderr, "cudaMalloc failed for d_a_struct: %s", cudaGetErrorString(err));
124 exit(1);
125 }
126 struct r32 *d_b_struct = NULL;
127 err = cudaMalloc((void**)&d_b_struct, sizeof(struct r32));
128 if (err != cudaSuccess) {
129 fprintf(stderr, "cudaMalloc failed for d_b_struct: %s", cudaGetErrorString(err));
130 exit(1);
131 }
132 err = cudaMemcpy(d_a_struct, &h_a_copy, sizeof(struct r32), cudaMemcpyHostToDevice);
133 if (err != cudaSuccess) {
134 fprintf(stderr, "cudaMemcpy H2D failed for d_a_struct: %s", cudaGetErrorString(err));
135 exit(1);
136 }
137 err = cudaMemcpy(d_b_struct, &h_b_copy, sizeof(struct r32), cudaMemcpyHostToDevice);
138 if (err != cudaSuccess) {
139 fprintf(stderr, "cudaMemcpy H2D failed for d_b_struct: %s", cudaGetErrorString(err));
140 exit(1);
141 }
142 int i_n = 10000000;
143 int threads_per_block = 256;
144 int blocks = (i_n + threads_per_block - 1) / threads_per_block;
145 dim3 grid_dim = {blocks, 1, 1};
146 dim3 block_dim = {threads_per_block, 1, 1};
147 void *kernel_args[] = {&d_a_struct, &d_b_struct, &i_n};
148 err = cudaLaunchKernel((void*)compute_kernel_0, grid_dim, block_dim, kernel_args, 0, NULL);
149 if (err != cudaSuccess) {
150 fprintf(stderr, "cudaLaunchKernel failed: %s", cudaGetErrorString(err));
151 exit(1);
152 }
153 err = cudaDeviceSynchronize();
154 if (err != cudaSuccess) {
155 fprintf(stderr, "cudaDeviceSynchronize failed: %s", cudaGetErrorString(err));
156 exit(1);
157 }
158 err = cudaMemcpy(a->data, d_a_data, a_data_size, cudaMemcpyDeviceToHost);
159 if (err != cudaSuccess) {
160 fprintf(stderr, "cudaMemcpy D2H failed for a_data: %s", cudaGetErrorString(err));
161 exit(1);
162 }
163 err = cudaMemcpy(b->data, d_b_data, b_data_size, cudaMemcpyDeviceToHost);
164 if (err != cudaSuccess) {
165 fprintf(stderr, "cudaMemcpy D2H failed for b_data: %s", cudaGetErrorString(err));
166 exit(1);
167 }
168 cudaFree(d_a_data);
169 cudaFree(d_a_struct);
170 cudaFree(d_b_data);
171 cudaFree(d_b_struct);
172 printf("%f%s%f", a->data[((0 + (a->dims[0].stride * (5 - a->dims[0].lower_bound))) + a->offset)], " ", b->data[((0 + (b->dims[0].stride * (5 - b->dims[0].lower_bound))) + b->offset)]);
173 if (a->data[((0 + (a->dims[0].stride * (5 - a->dims[0].lower_bound))) + a->offset)] != (float)(1705)) {
174 fprintf(stderr, "ERROR STOP");
175 exit(1);
176 }
177 if (b->data[((0 + (b->dims[0].stride * (5 - b->dims[0].lower_bound))) + b->offset)] != (float)(5)) {
178 fprintf(stderr, "ERROR STOP");
179 exit(1);
180 }
181 // FIXME: implicit deallocate(a, b, );
182 return 0;
183}
184
View CPU Implementation Header (cpu_impl.h)
1#ifndef CPU_IMPL_H
2#define CPU_IMPL_H
3#include <stdio.h>
4#include <stdlib.h>
5#include <string.h>
6#include <omp.h>
7#include <math.h>
8// CUDA Runtime API Emulation for CPU
9typedef enum {
10 cudaSuccess = 0,
11 cudaErrorMemoryAllocation = 2,
12 cudaErrorInvalidValue = 11
13} cudaError_t;
14// Device execution configuration
15typedef struct {
16 unsigned int x, y, z;
17} dim3;
18// Thread and block index emulation
19typedef struct {
20 unsigned int x, y, z;
21} uint3;
22// Global thread identifiers (CPU emulation)
23extern __thread uint3 threadIdx;
24extern __thread uint3 blockIdx;
25extern __thread dim3 blockDim;
26extern __thread dim3 gridDim;
27// Memory management API
28cudaError_t cudaMalloc(void **devPtr, size_t size);
29cudaError_t cudaFree(void *devPtr);
30cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, int kind);
31cudaError_t cudaDeviceSynchronize(void);
32// Memory copy kinds
33#define cudaMemcpyHostToDevice 1
34#define cudaMemcpyDeviceToHost 2
35#define cudaMemcpyDeviceToDevice 3
36// Kernel launch emulation - NOTE: Changed function signature
37cudaError_t cudaLaunchKernel(void *func, dim3 gridDim, dim3 blockDim,
38 void **args, size_t sharedMem, void *stream);
39// Error handling
40const char* cudaGetErrorString(cudaError_t error);
41// Device synchronization
42#define __syncthreads() _Pragma("omp barrier")
43// Memory allocation tracking structure
44typedef struct {
45 void *cpu_ptr;
46 void *device_ptr;
47 size_t size;
48 int is_allocated;
49} memory_tracker_t;
50// Initialization function
51void cpu_runtime_init(void);
52void cpu_runtime_cleanup(void);
53#endif // CPU_IMPL_H
54
View CPU Implementation Source (cpu_impl.c)
1#include "cpu_impl.h"
2// Thread-local storage for CUDA-like thread coordinates
3__thread uint3 threadIdx = {0, 0, 0};
4__thread uint3 blockIdx = {0, 0, 0};
5__thread dim3 blockDim = {1, 1, 1};
6__thread dim3 gridDim = {1, 1, 1};
7int counts=0;
8// Memory tracking table
9#define MAX_ALLOCATIONS 1024
10memory_tracker_t memory_table[MAX_ALLOCATIONS];
11int memory_count = 0;
12// CPU Runtime initialization
13void cpu_runtime_init(void) {
14 memory_count = 0;
15 for (int i = 0; i < 1024; i++) {
16 memory_table[i].cpu_ptr = NULL;
17 memory_table[i].device_ptr = NULL;
18 memory_table[i].size = 0;
19 memory_table[i].is_allocated = 0;
20 }
21}
22void cpu_runtime_cleanup(void) {
23 for (int i = 0; i < memory_count; i++) {
24 if (memory_table[i].is_allocated && memory_table[i].cpu_ptr) {
25 free(memory_table[i].cpu_ptr);
26 memory_table[i].is_allocated = 0;
27 }
28 }
29 memory_count = 0;
30}
31// CUDA Memory Management API Emulation
32cudaError_t cudaMalloc(void **devPtr, size_t size) {
33 if(memory_count > MAX_ALLOCATIONS) {
34 fprintf(stderr, "Error: Exceeded maximum memory allocations (%d)
35", MAX_ALLOCATIONS);
36 return cudaErrorMemoryAllocation;
37 }
38 void *ptr = malloc(size);
39 if (!ptr) {
40 return cudaErrorMemoryAllocation;
41 }
42 *devPtr = ptr;
43 memory_table[memory_count].cpu_ptr = ptr;
44 memory_table[memory_count].device_ptr = ptr; // Same on CPU
45 memory_table[memory_count].size = size;
46 memory_table[memory_count].is_allocated = 1;
47 memory_count++;
48 return cudaSuccess;
49}
50cudaError_t cudaFree(void *devPtr) {
51 for (int i = 0; i < memory_count; i++) {
52 if (memory_table[i].device_ptr == devPtr && memory_table[i].is_allocated) {
53 free(memory_table[i].cpu_ptr);
54 memory_table[i].is_allocated = 0;
55 return cudaSuccess;
56 }
57 }
58 return cudaErrorInvalidValue;
59}
60cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, int kind) {
61 memcpy(dst, src, count);
62 return cudaSuccess;
63}
64cudaError_t cudaDeviceSynchronize(void) {
65 return cudaSuccess;
66}
67// Forward declaration for the kernel wrapper
68void compute_kernel_wrapper(void **args, void *func);
69// kernel execution emulation
70cudaError_t cudaLaunchKernel(void *func, dim3 grid_dim, dim3 block_dim,
71 void **args, size_t sharedMem, void *stream) {
72 long long total_blocks = grid_dim.x * grid_dim.y * grid_dim.z;
73 long long threads_per_block = block_dim.x * block_dim.y * block_dim.z;
74 long long total_threads = total_blocks * threads_per_block;
75 long long max_omp_threads = omp_get_max_threads();
76 long long threads_to_use = (total_blocks < max_omp_threads) ? total_blocks : max_omp_threads;
77 #pragma omp parallel num_threads(threads_to_use)
78 {
79 long long omp_thread_id = omp_get_thread_num();
80 long long num_omp_threads = omp_get_num_threads();
81 for (long long block_id = omp_thread_id; block_id < total_blocks; block_id += num_omp_threads) {
82 long long bx = block_id % grid_dim.x;
83 long long by = (block_id / grid_dim.x) % grid_dim.y;
84 long long bz = block_id / (grid_dim.x * grid_dim.y);
85 for (long long thread_in_block = 0; thread_in_block < threads_per_block; thread_in_block++) {
86 blockIdx.x = bx;
87 blockIdx.y = by;
88 blockIdx.z = bz;
89 threadIdx.x = thread_in_block % block_dim.x;
90 threadIdx.y = (thread_in_block / block_dim.x) % block_dim.y;
91 threadIdx.z = thread_in_block / (block_dim.x * block_dim.y);
92 blockDim = block_dim;
93 gridDim = grid_dim;
94 compute_kernel_wrapper(args, func);
95 }
96 }
97 }
98 return cudaSuccess;
99}
100// Error handling
101const char* cudaGetErrorString(cudaError_t error) {
102 switch (error) {
103 case cudaSuccess:
104 return "cudaSuccess";
105 case cudaErrorMemoryAllocation:
106 return "cudaErrorMemoryAllocation";
107 case cudaErrorInvalidValue:
108 return "cudaErrorInvalidValue";
109 default:
110 return "Unknown CUDA error";
111 }
112}
113
Running the Generated C-Code in Both Modes
The code can run in two modes depending on the compiler and flags:
- GPU Mode: Compile with
nvcc
and defineUSE_GPU
to use the CUDA runtime.
Example commands:This offloads the computation to the GPU, using CUDA functions for memory management and kernel execution.$ gcc -I/lfortran/src/libasr/runtime -I/lfortran/src/ -c /lfortran/src/libasr/runtime/lfortran_intrinsics.c -o intrinsic.o$ nvcc -O2 -x cu -DUSE_GPU -I/lfortran/src/libasr/runtime -I/lfortran/src/ -c omp_off_gen.c -o omp_off_gen.o$ nvcc intrinsic.o omp_off_gen.o -lm -o a$ ./a - CPU Mode: Compile with
gcc
and includecpu_impl.c
for CPU emulation.
Example commands:This runs the code on the CPU, using OpenMP for parallelism and emulating CUDA behavior with the provided library.$ gcc -I/lfortran/src/libasr/runtime -I/lfortran/src/ -c /lfortran/src/libasr/runtime/lfortran_intrinsics.c -o intrinsic.o$ gcc -fopenmp -I/lfortran/src/libasr/runtime -I/lfortran/src/ cpu_impl.c omp_off_gen.c intrinsic.o -lm -o a$ ./a
Both modes produce the same result (a(5) = 1705
, b(5) = 5
), ensuring consistency across platforms.
Next Steps
For Week 12, I plan to:
- Prepare a PR with the dual-mode C-backend changes once CI testing is resolved.
- Extend support for more complex offloading scenarios, like nested regions.
I thank my mentors, Ondrej Certik, Pranav Goswami, and Gaurav Dhingra, for their valuable insights during our meeting. I also appreciate the LFortran community’s support as I work on this exciting feature.