GSOC Blog

I am Aditya Trivedi, a contributor to the Google Summer of Code (GSoC) 2025, working on enhancing OpenMP support in LFortran, a LLVM based Fortran compiler. This blog post provides a summary of my progress during the first week, where I have focused on laying the groundwork for extending OpenMP features such as teams, tasks, and sections. My project aims to build upon LFortran’s existing OpenMP capabilities, positioning it as a robust tool for high-performance computing (HPC).

Objective

The primary goal of this project is to expand LFortran’s OpenMP support to include constructs beyond the existing parallel do, such as teams, tasks, sections, single, and SIMD, in alignment with the OpenMP 6.0 standard. This enhancement will enable LFortran to handle complex parallel workloads, making it competitive with established compilers like GFortran and Clang for HPC applications. During Week 1, I worked for 38 Hours and efforts were concentrated on analyzing the current design, identifying its limitations, proposing a new approach, and studying how other compilers implement OpenMP, thereby establishing a foundation for the upcoming implementation phase.

Current Design

LFortran currently supports the parallel do construct with clauses such as private, shared, reduction, and collapse. The implementation, as detailed in Issue #3777, operates as follows:

Parsing: The visit_Pragma function in ast_body_visitor.cpp recognizes !$omp parallel do and converts it to a DoConcurrentLoop node in the Abstract Semantic Representation (ASR), capturing clauses and loop details.
Backend: The OpenMP pass (openmp.cpp) outlines the loop body into a function, partitions iterations across threads using omp_get_thread_num and omp_get_num_threads, and generates GOMP_parallel calls to the libgomp runtime.
Example: A parallel do loop is transformed into a DoConcurrentLoop node, lowered to a function with thread partitioning, as described in Issue #3777, Comment #2104814180.

While this design is effective for loop-based parallelism, it presents limitations when attempting to support other OpenMP constructs.

Challenges in the Current Design

While the DoConcurrentLoop approach is suitable for parallel do, extending it to support new constructs such as teams, tasks, and sections introduces several challenges:

Non-Loop Constructs: Constructs like sections (independent blocks) and tasks (dynamic scheduling) do not fit the loop-centric DoConcurrentLoop structure, necessitating complex workarounds.
Clause Support: New clauses (e.g., num_teams for teams, depend for tasks) are difficult to integrate into the existing node’s clause arrays.
Nesting: Handling nested constructs (e.g., parallel do inside teams) is challenging, as DoConcurrentLoop assumes a single loop level.
Scalability: Adapting a loop-based node for diverse constructs risks creating a convoluted design, which could complicate maintenance as OpenMP continues to evolve.

These limitations, which are further discussed in Issue #7332, necessitated the exploration of alternative designs to better accommodate a wider range of OpenMP constructs.

Proposed Design: `OMPRegion` ASR Node

To address the identified challenges, I propose the introduction of a new OMPRegion ASR node, designed to handle all OpenMP constructs in a flexible manner. The proposed node structure is outlined below:

View OMPRegion Node Structure

stmt
  = ...
  | OMPRegion(omp_region_type region, omp_clause* clauses, stmt* body)

omp_region_type
  = Parallel | Do | ParallelDo | Sections | Single | Task | Simd | Teams | Target | TargetData

omp_clause
  = OMPPrivate(expr* vars) | OMPShared(expr* vars) | OMPReduction(reduction_op operator, expr* vars) |

reduction_op
  = ReduceAdd | ReduceSub | ReduceMul | ReduceMIN | ReduceMAX

schedule_type
  = Static | Dynamic | Guided | Auto | Runtime

  ...

The benefits of this approach include:

Flexibility: The node supports both loop-based constructs (e.g., parallel do) and non-loop constructs (e.g., sections, tasks) naturally.
Extensibility: It facilitates the addition of new constructs and clauses by extending enums, ensuring alignment with the OpenMP 6.0 specification.
Nesting: Nested directives are managed effectively through recursive OMPRegion nodes, making it suitable for complex scenarios such as teams containing a parallel do.
Standards Alignment: The design mirrors GFortran’s tree nodes (e.g., OMP_SECTIONS, OMP_TASK) and Clang’s AST classes (e.g., OMPSectionsDirective, OMPTaskDirective), simplifying integration with the libgomp runtime.

This proposed design, along with prototype minimal reproducible examples (MREs), has been detailed in Issue #7332, demonstrating its feasibility for implementation.

Exploration of Clang and GFortran’s OpenMP Handling

To inform the design of the OMPRegion node, I conducted an analysis of how Clang and GFortran process OpenMP constructs, focusing on teams, tasks, and sections:

GFortran:
- Frontend: Directives are parsed into specific tree nodes (e.g., OMP_TEAMS, OMP_SECTIONS) with OMP_CLAUSE nodes to represent associated clauses.
- Backend: These nodes are lowered to GIMPLE, where the directive bodies are outlined into functions, and libgomp calls (e.g., GOMP_teams, GOMP_sections_start) are generated.
- Example: A sections directive is transformed into an OMP_SECTIONS node, which is then lowered to a switch statement with GOMP_sections_start, as documented in Issue #7332.
Clang:
- Frontend: OpenMP directives are represented as AST classes (e.g., OMPTeamsDirective, OMPTaskDirective), with separate objects for clauses.
- Backend: The AST is lowered to LLVM IR, generating libomp calls (e.g., __kmpc_fork_teams). LFortran, however, uses libgomp due to issues with variadic functions in libomp.
- Example: A task directive is represented as an OMPTaskDirective, which is lowered to __kmpc_omp_task calls.

This analysis, documented in Issue #7332, highlights the advantages of using specific nodes for each construct, as it enhances type safety and modularity. These findings support the adoption of the OMPRegion approach for LFortran.

Issues Opened

To track progress and facilitate collaboration, I have opened the following issues, each accompanied by minimal reproducible examples (MREs) in C and Fortran, both with and without pragmas, as well as GOMP-based implementations:

[OPENMP] TEAM Construct #7363: Proposes support for the teams construct with clauses such as num_teams and thread_limit.
[OPENMP] TASK Construct #7365: Focuses on implementing the task construct to enable dynamic scheduling.
[OPENMP] SECTIONS Construct #7366: Addresses the sections construct for concurrent execution of independent code blocks.
OpenMP Support Design Discussion #7332: Serves as the central hub for design discussions, MREs, and analysis of Clang and GFortran’s OpenMP implementations.

Example: Task Construct Representations

To illustrate the application of the proposed OMPRegion node, this section presents an example of the task construct as detailed in Issue #7366. The example is provided in multiple forms: Fortran code with OpenMP pragmas, Fortran code using GOMP runtime calls, the corresponding Clang AST representation, and the proposed LFortran ASR design.

Fortran Code with Pragmas

View Fortran Code with Pragmas

program parallel_processing
    use omp_lib
    implicit none

    integer, parameter :: N = 10
    integer :: i

    !$omp parallel
    !$omp single
    do i = 1, N
        !$omp task
        call process_item(i)
        !$omp end task
    end do
    !$omp end single
    !$omp end parallel

contains

    subroutine process_item(i)
        integer, intent(in) :: i
        integer :: thread_num

        thread_num = omp_get_thread_num()
        print *, "Processing item ", i, " on thread ", thread_num
    end subroutine process_item

end program parallel_processing

Fortran Code without Pragmas (Using GOMP Calls)

View Fortran Code with GOMP Calls

module thread_data_module_tasks
  use, intrinsic :: iso_c_binding
  implicit none
  type, bind(C) :: thread_data
    integer(c_int) :: i
  end type thread_data
  integer(c_long), parameter :: THREAD_DATA_SIZE = 4  ! Size of thread_data (bytes)
  integer(c_long), parameter :: THREAD_DATA_ALIGN = 4 ! Alignment of thread_data (bytes)
end module thread_data_module_tasks

module omp_lib
  use iso_c_binding
  implicit none
  interface
    subroutine GOMP_parallel(fn, data, num_threads, flags) bind(C, name="GOMP_parallel")
      import :: c_funptr, c_ptr, c_int
      type(c_funptr), value :: fn
      type(c_ptr), value :: data
      integer(c_int), value :: num_threads
      integer(c_int), value :: flags
    end subroutine
    subroutine GOMP_task(fn, data, cpyfn, arg_size, arg_align, if_clause, flags, depend) &
                         bind(C, name="GOMP_task")
      use, intrinsic :: iso_c_binding
      type(c_ptr), value :: fn, data, cpyfn, depend
      integer(c_long), value :: arg_size, arg_align
      logical(c_bool), value :: if_clause
      integer(c_int), value :: flags
    end subroutine
    function omp_get_thread_num() bind(c, name="omp_get_thread_num")
      import :: c_int
      integer(c_int) :: omp_get_thread_num
    end function
  end interface
end module omp_lib

subroutine process_item(i)
  use omp_lib
  implicit none
  integer, intent(in) :: i
  print *, "Processing item ", i, " on thread ", omp_get_thread_num()
end subroutine process_item

subroutine task_fn(data) bind(C)
  use thread_data_module_tasks
  implicit none
  type(c_ptr), value :: data
  type(thread_data), pointer :: d
  call c_f_pointer(data, d)
  call process_item(d%i)
end subroutine task_fn

subroutine parallel_region(data) bind(C)
  use thread_data_module_tasks
  use omp_lib
  implicit none
  type(c_ptr), value :: data
  integer(c_int), pointer :: n
  integer :: i
  type(thread_data), target :: task_data
  type(c_ptr) :: task_ptr

  interface
    subroutine task_fn(data) bind(C)
      use thread_data_module_tasks
      type(c_ptr), value :: data
    end subroutine task_fn
  end interface

  call c_f_pointer(data, n)
  if (omp_get_thread_num() == 0) then
    do i = 1, n
      task_data%i = i
      task_ptr = c_loc(task_data)
      call GOMP_task(c_funloc(task_fn), task_ptr, c_null_ptr, THREAD_DATA_SIZE, &
                     THREAD_DATA_ALIGN, .true._c_bool, 0, c_null_ptr)
    end do
  end if
end subroutine parallel_region

program main
  use thread_data_module_tasks
  use omp_lib
  use, intrinsic :: iso_c_binding
  implicit none
  integer, target :: n = 10
  type(c_ptr) :: ptr

  interface
    subroutine parallel_region(data) bind(C)
      use thread_data_module_tasks
      type(c_ptr), value :: data
    end subroutine parallel_region
  end interface

  ptr = c_loc(n)
  call GOMP_parallel(c_funloc(parallel_region), ptr, 0, 0)
end program main

Clang AST Representation

View Clang AST Representation

  |-OMPParallelDirective 0x5dcc678e58f0 <line:12:5, col:25>
  | `-CapturedStmt 0x5dcc678e5870 <line:13:5, line:23:5>
  |   |-CapturedDecl 0x5dcc678e44b8 <<invalid sloc>> <invalid sloc> nothrow
  |   | |-CompoundStmt 0x5dcc678e57d0 <line:13:5, line:23:5>
  |   | | `-OMPSingleDirective 0x5dcc678e5798 <line:14:9, col:27>
  |   | |   `-CapturedStmt 0x5dcc678e5738 <line:15:9, line:22:9>
  |   | |     |-CapturedDecl 0x5dcc678e4ae8 <<invalid sloc>> <invalid sloc>
  |   | |     | |-CompoundStmt 0x5dcc678e5698 <line:15:9, line:22:9>
  |   | |     | | `-ForStmt 0x5dcc678e5660 <line:16:13, line:21:13>
  |   | |     | |   |-DeclStmt 0x5dcc678e4c88 <line:16:18, col:27>
  |   | |     | |   | `-VarDecl 0x5dcc678e4c00 <col:18, col:26> col:22 used i 'int' cinit
  |   | |     | |   `-CompoundStmt 0x5dcc678e5648 <col:42, line:21:13>
  |   | |     | |     `-OMPTaskDirective 0x5dcc678e5600 <line:17:17, col:33>
  |   | |     | |       |-OMPFirstprivateClause 0x5dcc678e55c0 <<invalid sloc>> <implicit>
  |   | |     | |       `-CapturedStmt 0x5dcc678e53e0 <line:18:17, line:20:17>
  |   | |     | |         `-CapturedDecl 0x5dcc678e4f78 <<invalid sloc>> <invalid sloc> nothrow
  |   | |     | |           |-CompoundStmt 0x5dcc678e53c8 <line:18:17, line:20:17>
  |   | |     | |           | `-CallExpr 0x5dcc678e5388 <line:19:21, col:35> 'void'
  |   | |     | |           |   |-ImplicitCastExpr 0x5dcc678e5370 <col:21> 'void (*)(int)' <FunctionToPointerDecay>
  |   | |     | |           |   | `-DeclRefExpr 0x5dcc678e5300 <col:21> 'void (int)' Function 0x5dcc678e3f08 'process_item' 'void (int)'
  |   | |     | |           |   `-ImplicitCastExpr 0x5dcc678e53b0 <col:34> 'int' <LValueToRValue>

Proposed LFortran ASR Design

View Proposed LFortran ASR Design

OMPRegion(
  region = Parallel,
  clauses = [],
  body = [
    OMPRegion(
      region = Single,
      clauses = [],
      body = [
        DoLoop(
          head = [{v = "i", start = IntegerConstant(1), end = IntegerConstant(10)}],
          body = [
            OMPRegion(
              region = Task,
              clauses = [],
              body = [Call(symbol="process_item")]
            )
          ]
        )
      ]
    )
  ]
)

Next Steps

In Week 2, I plan to focus on the following tasks:

Represent the OMPRegion node in ASR for the sections construct (Issue #7366) with good and proper design of implementation in AST visitor such that it can be easily extended to represent other constructs and clauses in ASR easily.
Validate the MREs against the outputs of GFortran and Clang, using flags such as -fdump-tree-all and -Xclang -ast-dump to ensure correctness.

I would like to thank my mentors, Ondrej Certik, Pranav Goswami and Gaurav Dhingra for their critical reviews and guidance, which played an important role in improving the design of OMPRegion. I also thank the other contributors of LFortran for their support and help whenever needed.

Week 1 - Kickoff Contributions to OpenMP Support in LFortran