Following Week 10’s extension of the C-backend for target offloading, Week 11 focused on finding a way to test this on CI. After a fruitful discussion with my mentors Ondrej and Pranav, we agreed on a dual-mode approach for CPU and GPU testing. I spent about 32 hours adapting the C-backend to generate CUDA-specific code while maintaining CPU compatibility, ensuring the code works on both platforms with a simple switch.
Discussion and Strategy with Mentors
This week, I discussed with mentors to figure out how to test target offloading on CI, since GitHub Actions lacks GPU access. Our conclusion was to generate C-code with API calls, provide our own CPU implementations for CI testing (using a CPU_ENABLED switch), and use CUDA runtime for GPU testing (with a GPU_ENABLED switch). The plan was to extend the C-backend to produce CUDA code equivalent to the target, teams, and distribute parallel do constructs, including map clauses, while keeping CPU compatibility.
Implementation Approach
I started by building on last week’s C-backend output, which used OpenMP pragmas like target. I added a --target-offload flag to LFortran. When I run lfortran --openmp --show-c, it generates the usual OpenMP C code. With lfortran --openmp --show-c --target-offload, it produces CUDA-specific code. This keeps backward compatibility and lets me switch modes easily.
For the CUDA code, I first handwrote a version to test the concept, using LFortran’s array struct (struct r32 for floats) and ensuring the results matched. Initially, I thought to transfer each struct member (like data, dims) to the GPU separately, but this was inefficient and required editing every statement in the kernel. Instead, I copied the whole struct to the device using cudaMalloc and cudaMemcpy, updating only the data pointer to point to device memory. This simplified the code generation process.
Code Structure and Dual-Mode Support
The generated code uses a macro USE_GPU to switch between GPU and CPU modes. Here’s how it works:
- GPU Mode (
USE_GPUdefined): Includescuda_runtime.h, uses CUDA functions likecudaMallocandcudaLaunchKernel, and compiles withnvcc. The kernel is marked__global__, and thread and block IDs are set using CUDA’s grid and block dimensions. - CPU Mode (
USE_GPUnot defined): Includescpu_impl.h, which provides CPU implementations of CUDA functions using OpenMP. For example,cudaLaunchKernelemulates parallelism with#pragma omp parallel, adjustingblockIdx,threadIdx,blockDim, andgridDimto mimic CUDA behavior.
The kernel function compute_kernel_0 is defined with __global__ for GPU or as a regular function for CPU. Memory management uses cudaMalloc and cudaMemcpy on GPU, or malloc and memcpy on CPU, tracked in a memory_tracker_t structure.
View MRE for Target Offloading (openmp_70.f90)
1program openmp_70
2 implicit none
3 real, allocatable, dimension(:) :: a, b
4 integer :: i
5 allocate(a(10000000), b(10000000))
6 b=5
7 !$omp target map(tofrom:a, b)
8 !$omp teams
9 !$omp distribute parallel do
10 do i = 1, 10000000
11 a(i) = i + b(i)*340
12 end do
13 !$omp end distribute parallel do
14 !$omp end teams
15 !$omp end target
16 print*, a(5), b(5)
17 if(a(5) /= 1705) error stop
18 if(b(5) /= 5) error stop
19end program openmp_70View Generated C Code with OpenMP (with lfortran --openmp --show-c)
1#include <inttypes.h>
2#include <stdlib.h>
3#include <stdbool.h>
4#include <stdio.h>
5#include <string.h>
6#include <lfortran_intrinsics.h>
7
8struct dimension_descriptor
9{
10 int32_t lower_bound, length, stride;
11};
12
13struct r32
14{
15 float *data;
16 struct dimension_descriptor dims[32];
17 int32_t n_dims;
18 int32_t offset;
19 bool is_allocated;
20};
21
22// Implementations
23float _lcompilers_real_i32(int32_t x)
24{
25 float _lcompilers_real_i32;
26 _lcompilers_real_i32 = (float)(x);
27 return _lcompilers_real_i32;
28}
29
30int main(int argc, char* argv[])
31{
32 _lpython_set_argv(argc, argv);
33 int32_t __libasr_index_0_;
34 struct r32 a_value;
35 struct r32* a = &a_value;
36 float *a_data;
37 a->data = a_data;
38 a->n_dims = 1;
39 a->offset = 0;
40 a->dims[0].lower_bound = 1;
41 a->dims[0].length = 0;
42 a->dims[0].stride = 1;
43 struct r32 b_value;
44 struct r32* b = &b_value;
45 float *b_data;
46 b->data = b_data;
47 b->n_dims = 1;
48 b->offset = 0;
49 b->dims[0].lower_bound = 1;
50 b->dims[0].length = 0;
51 b->dims[0].stride = 1;
52 int32_t i;
53 a->n_dims = 1;
54 a->dims[0].lower_bound = 1;
55 a->dims[0].length = 10000000;
56 a->dims[0].stride = 1;
57 a->data = (float*) _lfortran_malloc(1*a->dims[0].length*sizeof(float));
58 a->is_allocated = true;
59 b->n_dims = 1;
60 b->dims[0].lower_bound = 1;
61 b->dims[0].length = 10000000;
62 b->dims[0].stride = 1;
63 b->data = (float*) _lfortran_malloc(1*b->dims[0].length*sizeof(float));
64 b->is_allocated = true;
65 for (__libasr_index_0_=((int32_t)b->dims[1-1].lower_bound); __libasr_index_0_<=((int32_t) b->dims[1-1].length + b->dims[1-1].lower_bound - 1); __libasr_index_0_++) {
66 b->data[((0 + (b->dims[0].stride * (__libasr_index_0_ - b->dims[0].lower_bound))) + b->offset)] = (float)(5);
67 }
68#pragma omp target map(tofrom:a, b)
69#pragma omp teams
70#pragma omp distribute parallel for
71 for (i=1; i<=10000000; i++) {
72 a->data[((0 + (a->dims[0].stride * (i - a->dims[0].lower_bound))) + a->offset)] = _lcompilers_real_i32(i) + b->data[((0 + (b->dims[0].stride * (i - b->dims[0].lower_bound))) + b->offset)]*(float)(340);
73 }
74 printf("%f%s%f
75", a->data[((0 + (a->dims[0].stride * (5 - a->dims[0].lower_bound))) + a->offset)], " ", b->data[((0 + (b->dims[0].stride * (5 - b->dims[0].lower_bound))) + b->offset)]);
76 if (a->data[((0 + (a->dims[0].stride * (5 - a->dims[0].lower_bound))) + a->offset)] != (float)(1705)) {
77 fprintf(stderr, "ERROR STOP");
78 exit(1);
79 }
80 if (b->data[((0 + (b->dims[0].stride * (5 - b->dims[0].lower_bound))) + b->offset)] != (float)(5)) {
81 fprintf(stderr, "ERROR STOP");
82 exit(1);
83 }
84 // FIXME: implicit deallocate(a, b, );
85 return 0;
86}
87View Generated C Code with CUDA (omp_off_gen.c) (with lfortran --openmp --show-c --target-offload)
1#include <inttypes.h>
2#include <stdlib.h>
3#include <stdbool.h>
4#include <stdio.h>
5#include <string.h>
6#include <lfortran_intrinsics.h>
7#ifdef USE_GPU
8#include<cuda_runtime.h>
9#else
10#include"cpu_impl.h"
11#endif
12struct dimension_descriptor
13{
14 int32_t lower_bound, length, stride;
15};
16struct r32
17{
18 float *data;
19 struct dimension_descriptor dims[32];
20 int32_t n_dims;
21 int32_t offset;
22 bool is_allocated;
23};
24// Implementations
25#ifdef USE_GPU
26__global__
27#endif
28void compute_kernel_0(struct r32 *a, struct r32 *b, int i_n) {
29 int i = blockIdx.x * blockDim.x + threadIdx.x + 1;
30 if (i <= i_n) {
31 a->data[((0 + (a->dims[0].stride * (i - a->dims[0].lower_bound))) + a->offset)] = (float)(i) + b->data[((0 + (b->dims[0].stride * (i - b->dims[0].lower_bound))) + b->offset)]*(float)(340);
32 }
33}
34#ifndef USE_GPU
35void compute_kernel_0_wrapper(void **args) {
36 struct r32 *a = *(struct r32**)args[0];
37 struct r32 *b = *(struct r32**)args[1];
38 int i_n = *(int*)args[2];
39 compute_kernel_0(a, b, i_n);
40}
41#endif
42#ifndef USE_GPU
43void compute_kernel_wrapper(void **args, void *func) {
44 if (func == (void*)compute_kernel_0) {
45 compute_kernel_0_wrapper(args);
46 return;
47 }
48 fprintf(stderr, "Unknown kernel function
49");
50 exit(1);
51}
52#endif
53int main(int argc, char* argv[])
54{
55 _lpython_set_argv(argc, argv);
56 int32_t __libasr_index_0_;
57 struct r32 a_value;
58 struct r32* a = &a_value;
59 float *a_data;
60 a->data = a_data;
61 a->n_dims = 1;
62 a->offset = 0;
63 a->dims[0].lower_bound = 1;
64 a->dims[0].length = 0;
65 a->dims[0].stride = 1;
66 struct r32 b_value;
67 struct r32* b = &b_value;
68 float *b_data;
69 b->data = b_data;
70 b->n_dims = 1;
71 b->offset = 0;
72 b->dims[0].lower_bound = 1;
73 b->dims[0].length = 0;
74 b->dims[0].stride = 1;
75 int32_t i;
76 a->n_dims = 1;
77 a->dims[0].lower_bound = 1;
78 a->dims[0].length = 10000000;
79 a->dims[0].stride = 1;
80 a->data = (float*) _lfortran_malloc(1*a->dims[0].length*sizeof(float));
81 a->is_allocated = true;
82 b->n_dims = 1;
83 b->dims[0].lower_bound = 1;
84 b->dims[0].length = 10000000;
85 b->dims[0].stride = 1;
86 b->data = (float*) _lfortran_malloc(1*b->dims[0].length*sizeof(float));
87 b->is_allocated = true;
88 for (__libasr_index_0_=((int32_t)b->dims[1-1].lower_bound); __libasr_index_0_<=((int32_t) b->dims[1-1].length + b->dims[1-1].lower_bound - 1); __libasr_index_0_++) {
89 b->data[((0 + (b->dims[0].stride * (__libasr_index_0_ - b->dims[0].lower_bound))) + b->offset)] = (float)(5);
90 }
91 float *d_a_data = NULL;
92 float *d_b_data = NULL;
93 cudaError_t err;
94 size_t a_data_size = a->dims[0].length * sizeof(float);
95 err = cudaMalloc((void**)&d_a_data, a_data_size);
96 if (err != cudaSuccess) {
97 fprintf(stderr, "cudaMalloc failed for a_data: %s", cudaGetErrorString(err));
98 exit(1);
99 }
100 size_t b_data_size = b->dims[0].length * sizeof(float);
101 err = cudaMalloc((void**)&d_b_data, b_data_size);
102 if (err != cudaSuccess) {
103 fprintf(stderr, "cudaMalloc failed for b_data: %s", cudaGetErrorString(err));
104 exit(1);
105 }
106 err = cudaMemcpy(d_a_data, a->data, a_data_size, cudaMemcpyHostToDevice);
107 if (err != cudaSuccess) {
108 fprintf(stderr, "cudaMemcpy H2D failed for a_data: %s", cudaGetErrorString(err));
109 exit(1);
110 }
111 err = cudaMemcpy(d_b_data, b->data, b_data_size, cudaMemcpyHostToDevice);
112 if (err != cudaSuccess) {
113 fprintf(stderr, "cudaMemcpy H2D failed for b_data: %s", cudaGetErrorString(err));
114 exit(1);
115 }
116 struct r32 h_a_copy = *a;
117 h_a_copy.data = d_a_data;
118 struct r32 h_b_copy = *b;
119 h_b_copy.data = d_b_data;
120 struct r32 *d_a_struct = NULL;
121 err = cudaMalloc((void**)&d_a_struct, sizeof(struct r32));
122 if (err != cudaSuccess) {
123 fprintf(stderr, "cudaMalloc failed for d_a_struct: %s", cudaGetErrorString(err));
124 exit(1);
125 }
126 struct r32 *d_b_struct = NULL;
127 err = cudaMalloc((void**)&d_b_struct, sizeof(struct r32));
128 if (err != cudaSuccess) {
129 fprintf(stderr, "cudaMalloc failed for d_b_struct: %s", cudaGetErrorString(err));
130 exit(1);
131 }
132 err = cudaMemcpy(d_a_struct, &h_a_copy, sizeof(struct r32), cudaMemcpyHostToDevice);
133 if (err != cudaSuccess) {
134 fprintf(stderr, "cudaMemcpy H2D failed for d_a_struct: %s", cudaGetErrorString(err));
135 exit(1);
136 }
137 err = cudaMemcpy(d_b_struct, &h_b_copy, sizeof(struct r32), cudaMemcpyHostToDevice);
138 if (err != cudaSuccess) {
139 fprintf(stderr, "cudaMemcpy H2D failed for d_b_struct: %s", cudaGetErrorString(err));
140 exit(1);
141 }
142 int i_n = 10000000;
143 int threads_per_block = 256;
144 int blocks = (i_n + threads_per_block - 1) / threads_per_block;
145 dim3 grid_dim = {blocks, 1, 1};
146 dim3 block_dim = {threads_per_block, 1, 1};
147 void *kernel_args[] = {&d_a_struct, &d_b_struct, &i_n};
148 err = cudaLaunchKernel((void*)compute_kernel_0, grid_dim, block_dim, kernel_args, 0, NULL);
149 if (err != cudaSuccess) {
150 fprintf(stderr, "cudaLaunchKernel failed: %s", cudaGetErrorString(err));
151 exit(1);
152 }
153 err = cudaDeviceSynchronize();
154 if (err != cudaSuccess) {
155 fprintf(stderr, "cudaDeviceSynchronize failed: %s", cudaGetErrorString(err));
156 exit(1);
157 }
158 err = cudaMemcpy(a->data, d_a_data, a_data_size, cudaMemcpyDeviceToHost);
159 if (err != cudaSuccess) {
160 fprintf(stderr, "cudaMemcpy D2H failed for a_data: %s", cudaGetErrorString(err));
161 exit(1);
162 }
163 err = cudaMemcpy(b->data, d_b_data, b_data_size, cudaMemcpyDeviceToHost);
164 if (err != cudaSuccess) {
165 fprintf(stderr, "cudaMemcpy D2H failed for b_data: %s", cudaGetErrorString(err));
166 exit(1);
167 }
168 cudaFree(d_a_data);
169 cudaFree(d_a_struct);
170 cudaFree(d_b_data);
171 cudaFree(d_b_struct);
172 printf("%f%s%f", a->data[((0 + (a->dims[0].stride * (5 - a->dims[0].lower_bound))) + a->offset)], " ", b->data[((0 + (b->dims[0].stride * (5 - b->dims[0].lower_bound))) + b->offset)]);
173 if (a->data[((0 + (a->dims[0].stride * (5 - a->dims[0].lower_bound))) + a->offset)] != (float)(1705)) {
174 fprintf(stderr, "ERROR STOP");
175 exit(1);
176 }
177 if (b->data[((0 + (b->dims[0].stride * (5 - b->dims[0].lower_bound))) + b->offset)] != (float)(5)) {
178 fprintf(stderr, "ERROR STOP");
179 exit(1);
180 }
181 // FIXME: implicit deallocate(a, b, );
182 return 0;
183}
184View CPU Implementation Header (cpu_impl.h)
1#ifndef CPU_IMPL_H
2#define CPU_IMPL_H
3#include <stdio.h>
4#include <stdlib.h>
5#include <string.h>
6#include <omp.h>
7#include <math.h>
8// CUDA Runtime API Emulation for CPU
9typedef enum {
10 cudaSuccess = 0,
11 cudaErrorMemoryAllocation = 2,
12 cudaErrorInvalidValue = 11
13} cudaError_t;
14// Device execution configuration
15typedef struct {
16 unsigned int x, y, z;
17} dim3;
18// Thread and block index emulation
19typedef struct {
20 unsigned int x, y, z;
21} uint3;
22// Global thread identifiers (CPU emulation)
23extern __thread uint3 threadIdx;
24extern __thread uint3 blockIdx;
25extern __thread dim3 blockDim;
26extern __thread dim3 gridDim;
27// Memory management API
28cudaError_t cudaMalloc(void **devPtr, size_t size);
29cudaError_t cudaFree(void *devPtr);
30cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, int kind);
31cudaError_t cudaDeviceSynchronize(void);
32// Memory copy kinds
33#define cudaMemcpyHostToDevice 1
34#define cudaMemcpyDeviceToHost 2
35#define cudaMemcpyDeviceToDevice 3
36// Kernel launch emulation - NOTE: Changed function signature
37cudaError_t cudaLaunchKernel(void *func, dim3 gridDim, dim3 blockDim,
38 void **args, size_t sharedMem, void *stream);
39// Error handling
40const char* cudaGetErrorString(cudaError_t error);
41// Device synchronization
42#define __syncthreads() _Pragma("omp barrier")
43// Memory allocation tracking structure
44typedef struct {
45 void *cpu_ptr;
46 void *device_ptr;
47 size_t size;
48 int is_allocated;
49} memory_tracker_t;
50// Initialization function
51void cpu_runtime_init(void);
52void cpu_runtime_cleanup(void);
53#endif // CPU_IMPL_H
54View CPU Implementation Source (cpu_impl.c)
1#include "cpu_impl.h"
2// Thread-local storage for CUDA-like thread coordinates
3__thread uint3 threadIdx = {0, 0, 0};
4__thread uint3 blockIdx = {0, 0, 0};
5__thread dim3 blockDim = {1, 1, 1};
6__thread dim3 gridDim = {1, 1, 1};
7int counts=0;
8// Memory tracking table
9#define MAX_ALLOCATIONS 1024
10memory_tracker_t memory_table[MAX_ALLOCATIONS];
11int memory_count = 0;
12// CPU Runtime initialization
13void cpu_runtime_init(void) {
14 memory_count = 0;
15 for (int i = 0; i < 1024; i++) {
16 memory_table[i].cpu_ptr = NULL;
17 memory_table[i].device_ptr = NULL;
18 memory_table[i].size = 0;
19 memory_table[i].is_allocated = 0;
20 }
21}
22void cpu_runtime_cleanup(void) {
23 for (int i = 0; i < memory_count; i++) {
24 if (memory_table[i].is_allocated && memory_table[i].cpu_ptr) {
25 free(memory_table[i].cpu_ptr);
26 memory_table[i].is_allocated = 0;
27 }
28 }
29 memory_count = 0;
30}
31// CUDA Memory Management API Emulation
32cudaError_t cudaMalloc(void **devPtr, size_t size) {
33 if(memory_count > MAX_ALLOCATIONS) {
34 fprintf(stderr, "Error: Exceeded maximum memory allocations (%d)
35", MAX_ALLOCATIONS);
36 return cudaErrorMemoryAllocation;
37 }
38 void *ptr = malloc(size);
39 if (!ptr) {
40 return cudaErrorMemoryAllocation;
41 }
42 *devPtr = ptr;
43 memory_table[memory_count].cpu_ptr = ptr;
44 memory_table[memory_count].device_ptr = ptr; // Same on CPU
45 memory_table[memory_count].size = size;
46 memory_table[memory_count].is_allocated = 1;
47 memory_count++;
48 return cudaSuccess;
49}
50cudaError_t cudaFree(void *devPtr) {
51 for (int i = 0; i < memory_count; i++) {
52 if (memory_table[i].device_ptr == devPtr && memory_table[i].is_allocated) {
53 free(memory_table[i].cpu_ptr);
54 memory_table[i].is_allocated = 0;
55 return cudaSuccess;
56 }
57 }
58 return cudaErrorInvalidValue;
59}
60cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, int kind) {
61 memcpy(dst, src, count);
62 return cudaSuccess;
63}
64cudaError_t cudaDeviceSynchronize(void) {
65 return cudaSuccess;
66}
67// Forward declaration for the kernel wrapper
68void compute_kernel_wrapper(void **args, void *func);
69// kernel execution emulation
70cudaError_t cudaLaunchKernel(void *func, dim3 grid_dim, dim3 block_dim,
71 void **args, size_t sharedMem, void *stream) {
72 long long total_blocks = grid_dim.x * grid_dim.y * grid_dim.z;
73 long long threads_per_block = block_dim.x * block_dim.y * block_dim.z;
74 long long total_threads = total_blocks * threads_per_block;
75 long long max_omp_threads = omp_get_max_threads();
76 long long threads_to_use = (total_blocks < max_omp_threads) ? total_blocks : max_omp_threads;
77 #pragma omp parallel num_threads(threads_to_use)
78 {
79 long long omp_thread_id = omp_get_thread_num();
80 long long num_omp_threads = omp_get_num_threads();
81 for (long long block_id = omp_thread_id; block_id < total_blocks; block_id += num_omp_threads) {
82 long long bx = block_id % grid_dim.x;
83 long long by = (block_id / grid_dim.x) % grid_dim.y;
84 long long bz = block_id / (grid_dim.x * grid_dim.y);
85 for (long long thread_in_block = 0; thread_in_block < threads_per_block; thread_in_block++) {
86 blockIdx.x = bx;
87 blockIdx.y = by;
88 blockIdx.z = bz;
89 threadIdx.x = thread_in_block % block_dim.x;
90 threadIdx.y = (thread_in_block / block_dim.x) % block_dim.y;
91 threadIdx.z = thread_in_block / (block_dim.x * block_dim.y);
92 blockDim = block_dim;
93 gridDim = grid_dim;
94 compute_kernel_wrapper(args, func);
95 }
96 }
97 }
98 return cudaSuccess;
99}
100// Error handling
101const char* cudaGetErrorString(cudaError_t error) {
102 switch (error) {
103 case cudaSuccess:
104 return "cudaSuccess";
105 case cudaErrorMemoryAllocation:
106 return "cudaErrorMemoryAllocation";
107 case cudaErrorInvalidValue:
108 return "cudaErrorInvalidValue";
109 default:
110 return "Unknown CUDA error";
111 }
112}
113Running the Generated C-Code in Both Modes
The code can run in two modes depending on the compiler and flags:
- GPU Mode: Compile with
nvccand defineUSE_GPUto use the CUDA runtime.
Example commands:This offloads the computation to the GPU, using CUDA functions for memory management and kernel execution.$ gcc -I/lfortran/src/libasr/runtime -I/lfortran/src/ -c /lfortran/src/libasr/runtime/lfortran_intrinsics.c -o intrinsic.o$ nvcc -O2 -x cu -DUSE_GPU -I/lfortran/src/libasr/runtime -I/lfortran/src/ -c omp_off_gen.c -o omp_off_gen.o$ nvcc intrinsic.o omp_off_gen.o -lm -o a$ ./a - CPU Mode: Compile with
gccand includecpu_impl.cfor CPU emulation.
Example commands:This runs the code on the CPU, using OpenMP for parallelism and emulating CUDA behavior with the provided library.$ gcc -I/lfortran/src/libasr/runtime -I/lfortran/src/ -c /lfortran/src/libasr/runtime/lfortran_intrinsics.c -o intrinsic.o$ gcc -fopenmp -I/lfortran/src/libasr/runtime -I/lfortran/src/ cpu_impl.c omp_off_gen.c intrinsic.o -lm -o a$ ./a
Both modes produce the same result (a(5) = 1705, b(5) = 5), ensuring consistency across platforms.
Next Steps
For Week 12, I plan to:
- Prepare a PR with the dual-mode C-backend changes once CI testing is resolved.
- Extend support for more complex offloading scenarios, like nested regions.
I thank my mentors, Ondrej Certik, Pranav Goswami, and Gaurav Dhingra, for their valuable insights during our meeting. I also appreciate the LFortran community’s support as I work on this exciting feature.