Following Week 11’s development of the dual-mode C-backend for CPU and GPU offloading, Week 12, the final week of GSoC, focused on cleaning up the branch and preparing PR #8243 for merge. Last week, I planned to refine the backend and explore CI solutions. This week, I fixed CI failures in the target offload backend, renamed the CPU implementation file, merged a PR for the combined TEAMS DISTRIBUTE construct, and documented steps for using the C-backend to dump OMP and CUDA code, spending about 24 hours wrapping up these tasks.

Implementation Details and Fixes

In PR #8243, I addressed CI failures caused by missing runtime dependencies, specifically the config.h file, which CMake generates from config.h.in during the LFortran build. After debugging, I configured it in the CMakeLists.txt for the target_offload backend, ensuring it’s included only for that case. This fixed the issues, and the CI passed. I also renamed cpu_impl.h to cuda_runtime_impl.h.

View Updated CMakeLists.txt for Target Offload Backend
1elseif (LFORTRAN_BACKEND STREQUAL "target_offload")
2            set(c_file "${CURRENT_BINARY_DIR}/${file_name}.c")
3            execute_process(
4                COMMAND lfortran ${extra_args} --openmp --show-c --target-offload
5                        ${CMAKE_CURRENT_SOURCE_DIR}/${file_name}.f90
6                OUTPUT_FILE ${c_file}
7                RESULT_VARIABLE convert_result
8            )
9            configure_file(
10                ${CMAKE_SOURCE_DIR}/../src/libasr/config.h.in
11                ${CMAKE_SOURCE_DIR}/../src/libasr/config.h
12                @ONLY
13            )
14            add_executable(${name}
15                ${CMAKE_SOURCE_DIR}/../src/libasr/runtime/lfortran_intrinsics.c
16                ${CMAKE_SOURCE_DIR}/../src/libasr/runtime/cuda_runtime_impl.c
17                ${c_file}
18            )
19            target_include_directories(${name} PUBLIC
20                ${CMAKE_SOURCE_DIR}/../src/
21                ${CMAKE_SOURCE_DIR}/../src/libasr/runtime
22            )
23            target_compile_options(${name} PUBLIC -fopenmp)
24            target_link_libraries(${name} PUBLIC m)
25            target_link_options(${name} PUBLIC -fopenmp)
26            target_link_options(${name} PUBLIC -L$ENV{CONDA_PREFIX}/lib ${extra_args})
27            add_test(NAME ${name} COMMAND ${name})
28

In a separate PR, I implemented the combined TEAMS DISTRIBUTE construct, as individual TEAMS and DISTRIBUTE worked, but not their combined form. This ensures consistent behavior for nested and combined directives. I also documented steps to generate OMP and CUDA code using the C-backend in Issue #4497.

Next Steps

As this is the final week of GSoC, my focus will shift to wrapping up documentation, preparing for the final evaluation, and planning post-GSoC contributions.

  • Finalize any open issues and merge remaining features.
  • Expand the documentation for OpenMP support in LFortran.
  • Explore additional OpenMP constructs for future work.

Conclusion on the 12-Week Journey

Over these 12 weeks, I significantly advanced OpenMP support in LFortran, enabling a broader set of parallel programming capabilities and paving the way for high-performance computing and GPU acceleration.

  • Thread-based parallel constructs: parallel, do, single, master
  • Task-based constructs: task, taskloop, taskwait
  • Teams & distributed work constructs: teams, distribute, section/sections
  • Synchronization constructs: critical, taskwait, atomic, barrier
  • Data & control clauses: reduction, shared, private, collapse, schedule
  • Environment & resource clauses: num_teams, num_threads, thread_limit

Full technical details can be found in Issue #7332.

The Journey: From Designing ASR to GPU Offloading in a Nutshell

  • Design phase: Created the OMPRegion ASR node to support nested and combined directives with a stack-based approach for scalability and maintainability.
  • Implementation phase: Incrementally integrated OpenMP constructs and clauses into the compiler’s OpenMP pass, resolving challenges like variable scoping, shared data handling, and segmentation faults.
  • GPU exploration phase: Studied Clang’s LLVM-based host-device model, libomptarget runtime, and GPU memory management to extend LFortran’s C-backend for target offloading. Resulted in PR #8243 and Issue #4497, enabling CPU/GPU dual-mode execution and dumping equivalent OMP-C/CUDA code.

Key Learnings

  • Designing compiler features is the most critical as well as important foundational step—deep research and forward planning dictate the feasibility of the entire implementations of all the feature in feature set.
  • Implementation often requires multiple iterations and careful debugging to achieve stability and optimal performance.
  • Understanding the full stack—from language constructs to runtime behavior—is essential when bridging CPU and GPU execution paths.
  • Effective communication and collaboration with mentors and the open-source community boosts problem-solving and knowledge growth.

Final Status

  • 12 OpenMP constructs and 8 clauses fully implemented and tested in LFortran.
  • OMPRegion-based architecture ready for future OpenMP extensions.
  • Initial GPU offloading support integrated via C-backend with host-device mode switching.
  • Foundational groundwork laid for LFortran to compile and run parallel Fortran code efficiently on both CPUs and GPUs.

I’m grateful for this opportunity to contribute to the HPC domain and for the guidance of my mentors — Ondrej Certik, Pranav Goswami, and Gaurav Dhingra — as well as the LFortran community, whose support made this 12-week journey both rewarding and transformative.