Building on the progress made in Week 2, where I implemented the foundational structure for the OMPRegion ASR node and adopted a stack-based approach to handle nested OpenMP constructs, Week 3 focused on integrating the existing parallel do logic into this new design. In my previous blog post, I outlined a plan to extend the OpenMP pass to visit the OMPRegion node and support the sections construct. However, I decided to first shift the existing parallel do logic to use the OMPRegion node, as this would lay the groundwork for implementing all other constructs, including sections. This week, I successfully completed this transition through PR #7593, dedicating approximately 32 hours to ensure no regression in existing test cases while extending the OpenMP pass to handle parallel and do constructs in both nested and combined forms.

Choice of Shifting Parallel Do Logic First

Initially, my plan was to implement the sections construct in OpenMp pass, as it represents a non-loop-based paradigm distinct from the existing parallel do implementation. However, upon further consideration, I realized that implementing sections would anyway require support for the standalone parallel construct, which is a fundamental component of many OpenMP directives, including parallel do. So, rather than diving straight into sections construct I chose to first shift the existing parallel do logic from converting directly to a DoConcurrentLoop node to the more adaptable OMPRegion node and then lower down this new ASR node in the OpenMp pass.

This shift was crucial for several reasons. First, it ensures a unified representation of all OpenMP constructs under the OMPRegion node, facilitating future extensions for constructs like sections and teams. Second, I needed to guarantee that all existing OpenMP test cases, which currently compile successfully, would not break during this transition. I anticipated that this would be a longer PR, as it required updating all reference test cases to reflect the new ASR node and ensuring that the extended OpenMP pass could handle the updated structure without regression.

Implementation Details and Results

In PR #7593, I made significant changes to de-couple the DoConcurrentLoop logic from OMP Pragmas. The key changes included:

  • Removed the older logic in the AST -> ASR conversion that directly converted parallel do pragmas to a DoConcurrentLoop node, replacing it with a conversion to an OMPRegion node.
  • Extended the OpenMP pass to visit the OMPRegion node, handling parallel do constructs in a manner equivalent to the previous DoConcurrentLoop approach with moderate changes.
  • Updated all ASR of all existing OpenMP test cases to reflect the new changes, verifying that each test case compiles successfully with the extended OpenMP pass, thereby ensuring no regression in existing functionality.

The results of this implementation are significant. First, the transition to the OMPRegion node allows LFortran to handle standalone parallel constructs, as well as parallel do in both nested and combined forms, without disrupting existing test cases. For example, a standalone parallel region can now be represented and executed correctly, also nested constructs like parallel containing a do loop, or combined constructs like parallel do are too handled. Additionally, intermediate statements within parallel regions, which previously posed challenges (e.g., as seen in openmp_06.f90), are now handled seamlessly, addressing issues like those in Issue #4147. This PR also lays the groundwork for implementing other constructs like sections, for which placeholder functions have been added to demonstrate the intended approach.

Example: Parallel Do Construct with Reduction

To illustrate the new OMPRegion implementation for the parallel do construct, consider the following example program, openmp_08.f90, which uses a parallel do construct with a reduction clause:

View Original Fortran Code
1subroutine increment_ctr(n, ctr)
2    use omp_lib
3    implicit none
4    integer, intent(in) :: n
5    real, intent(out) :: ctr
6    
7    real :: local_ctr
8    
9    integer :: i
10    
11    local_ctr = 1
12    !$omp parallel private(i) reduction(*:local_ctr)
13    !$omp do
14    do i = 1, n
15        local_ctr = local_ctr * 1.5
16    end do
17    !$omp end do
18    !$omp end parallel
19    
20    ctr = ctr + local_ctr
21end subroutine
22    
23program openmp_08
24    use omp_lib
25    integer, parameter :: n = 10
26    real :: ctr
27    real :: res = 1.5**10
28    
29    call omp_set_num_threads(8)
30    ctr = 0
31    call increment_ctr(n, ctr)
32    print *, ctr
33    if(abs((ctr - res)) > 0.0002 ) error stop
34end program

The OpenMP pass now converts this parallel do construct into an OMPRegion node and lowers it to use GOMP runtime calls. The equivalent lowered Fortran code, which explicitly uses GOMP calls to achieve the same parallelism, is shown below:

View Lowered Fortran Code Using GOMP Calls
1module thread_data_module
2  use, intrinsic :: iso_c_binding
3  implicit none
4  type, bind(C) :: thread_data
5    integer(c_int) :: i
6    real(c_float) :: local_ctr
7    integer(c_int) :: n
8  end type thread_data
9end module thread_data_module
10
11module omp_lib
12  use iso_c_binding
13  implicit none
14  interface
15    subroutine GOMP_parallel(fn, data, num_threads, flags) bind(C, name="GOMP_parallel")
16      import :: c_funptr, c_ptr, c_int
17      type(c_funptr), value :: fn
18      type(c_ptr), value :: data
19      integer(c_int), value :: num_threads
20      integer(c_int), value :: flags
21    end subroutine
22    subroutine GOMP_atomic_start() bind(C, name="GOMP_atomic_start")
23    end subroutine
24    subroutine GOMP_atomic_end() bind(C, name="GOMP_atomic_end")
25    end subroutine
26    subroutine GOMP_barrier() bind(C, name="GOMP_barrier")
27    end subroutine
28    function omp_get_max_threads() bind(C, name="omp_get_max_threads")
29      import :: c_int
30      integer(c_int) :: omp_get_max_threads
31    end function
32    function omp_get_thread_num() bind(C, name="omp_get_thread_num")
33      import :: c_int
34      integer(c_int) :: omp_get_thread_num
35    end function
36    subroutine omp_set_num_threads(n) bind(C, name="omp_set_num_threads")
37      import :: c_int
38      integer(c_int), value :: n
39    end subroutine
40  end interface
41end module omp_lib
42
43subroutine lcompilers_parallel_func(data) bind(C)
44  use thread_data_module
45  use omp_lib
46  implicit none
47  type(c_ptr), value :: data
48  type(thread_data), pointer :: tdata
49  integer :: i, n, num_threads, thread_num, start, end, chunk, leftovers, I
50  real :: local_ctr
51
52  call c_f_pointer(data, tdata)
53
54  ! Extract variables from thread_data
55  i = tdata%i
56  local_ctr = tdata%local_ctr
57  n = tdata%n
58
59  ! Thread partitioning logic
60  num_threads = omp_get_max_threads()
61  chunk = (1 * ((n - 1) + 1)) / num_threads
62  leftovers = mod(1 * ((n - 1) + 1), num_threads)
63  thread_num = omp_get_thread_num()
64
65  start = chunk * thread_num
66  if (thread_num < leftovers) then
67    start = start + thread_num
68  else
69    start = start + leftovers
70  end if
71
72  end = start + chunk
73  if (thread_num < leftovers) then
74    end = end + 1
75  end if
76
77  ! Initialize local reduction variable
78  local_ctr = 1.0
79
80  ! Loop over assigned iterations
81  do I = start + 1, end
82    i = mod(I, (n - 1) + 1) + 1
83    local_ctr = local_ctr * 1.5
84  end do
85
86  ! Atomic update for reduction
87  call GOMP_atomic_start()
88  tdata%local_ctr = tdata%local_ctr * local_ctr
89  call GOMP_atomic_end()
90
91  ! Barrier synchronization
92  call GOMP_barrier()
93end subroutine lcompilers_parallel_func
94
95subroutine increment_ctr(n, ctr)
96  use thread_data_module
97  use omp_lib
98  implicit none
99  integer, intent(in) :: n
100  real, intent(out) :: ctr
101
102  real :: local_ctr
103  integer :: i
104  type(thread_data), target :: data
105  type(c_ptr) :: tdata
106
107  local_ctr = 1.0
108
109  ! Populate thread_data structure
110  data%i = i
111  data%local_ctr = local_ctr
112  data%n = n
113
114  ! Convert to C pointer
115  tdata = c_loc(data)
116
117  ! Call GOMP_parallel to execute the parallel region
118  call GOMP_parallel(c_funloc(lcompilers_parallel_func), tdata, 0, 0)
119
120  ! Retrieve updated local_ctr after parallel execution
121  local_ctr = data%local_ctr
122
123  ! Update output variable
124  ctr = ctr + local_ctr
125end subroutine increment_ctr
126
127program openmp_08
128  use omp_lib
129  implicit none
130  integer, parameter :: n = 10
131  real :: ctr
132  real :: res = 1.5**10
133
134  call omp_set_num_threads(8)
135  ctr = 0.0
136  call increment_ctr(n, ctr)
137  print *, ctr
138  if (abs(ctr - res) > 0.0002) then
139    error stop
140  end if
141end program openmp_08

In the lowered code, the parallel do construct is replaced with explicit GOMP runtime calls. A thread_data structure is used to pass variables between the main function and the parallel function, ensuring thread-safe access. The GOMP_parallel call invokes the parallel region by executing lcompilers_parallel_func, which handles thread partitioning by calculating each thread’s start and end indices based on the number of threads and loop iterations. The reduction(*:local_ctr) clause is implemented using GOMP_atomic_start and GOMP_atomic_end to atomically update the shared local_ctr variable, and a GOMP_barrier ensures all threads synchronize before proceeding.

Below are the relevant portions of the ASR generated by OpenMp pass, for the increment_ctr subroutine, the parallel function lcompilers_parallel_func, and the main program openmp_08:

ASR for increment_ctr Subroutine

View ASR for increment_ctr
1[(Assignment
2    (Var 2 local_ctr)
3    (Cast
4        (IntegerConstant 1 (Integer 4) Decimal)
5        IntegerToReal
6        (Real 4)
7        (RealConstant 1.000000 (Real 4))
8    )
9    ()
10    .false.
11)
12(Assignment
13    (StructInstanceMember (Var 2 data) 2 thread_data_i (Integer 4) ())
14    (Var 2 i)
15    ()
16    .false.
17)
18(Assignment
19    (StructInstanceMember (Var 2 data) 2 thread_data_local_ctr (Real 4) ())
20    (Var 2 local_ctr)
21    ()
22    .false.
23)
24(Assignment
25    (StructInstanceMember (Var 2 data) 2 thread_data_n (Integer 4) ())
26    (Var 2 n)
27    ()
28    .false.
29)
30(Assignment
31    (Var 2 tdata)
32    (PointerToCPtr
33        (GetPointer (Var 2 data) (Pointer (StructType [] [] .true. 2 thread_data)))
34        (CPtr)
35        ()
36    )
37    ()
38    .false.
39)
40(SubroutineCall
41    2 gomp_parallel
42    ()
43    [((PointerToCPtr
44        (GetPointer (Var 2 lcompilers_parallel_func)
45            (Pointer (FunctionType [(CPtr)] () BindC Interface () .false. .false. .false. .false. .false. [] .false.)))
46        (CPtr)
47        ()
48    ))
49    ((Var 2 tdata))
50    ((IntegerConstant 0 (Integer 4) Decimal))
51    ((IntegerConstant 0 (Integer 4) Decimal))]
52    ()
53)
54(Assignment
55    (Var 2 local_ctr)
56    (StructInstanceMember (Var 2 data) 2 thread_data_local_ctr (Real 4) ())
57    ()
58    .false.
59)
60(Assignment
61    (Var 2 ctr)
62    (RealBinOp (Var 2 ctr) Add (Var 2 local_ctr) (Real 4) ())
63    ()
64    .false.
65)]

ASR for lcompilers_parallel_func

View ASR for lcompilers_parallel_func
1[(CPtrToPointer
2    (Var 30 data)
3    (Var 30 tdata)
4    ()
5    ()
6)
7(Assignment
8    (Var 30 i)
9    (StructInstanceMember (Var 30 tdata) 30 thread_data_i (Integer 4) ())
10    ()
11    .false.
12)
13(Assignment
14    (Var 30 local_ctr)
15    (StructInstanceMember (Var 30 tdata) 30 thread_data_local_ctr (Real 4) ())
16    ()
17    .false.
18)
19(Assignment
20    (Var 30 n)
21    (StructInstanceMember (Var 30 tdata) 30 thread_data_n (Integer 4) ())
22    ()
23    .false.
24)
25(Assignment
26    (Var 30 num_threads)
27    (FunctionCall 30 omp_get_max_threads 30 omp_get_max_threads [] (Integer 4) () ())
28    ()
29    .false.
30)
31(Assignment
32    (Var 30 chunk)
33    (IntegerBinOp
34        (IntegerBinOp
35            (IntegerConstant 1 (Integer 4) Decimal)
36            Mul
37            (IntegerBinOp
38                (IntegerBinOp (Var 30 n) Sub (IntegerConstant 1 (Integer 4) Decimal) (Integer 4) ())
39                Add
40                (IntegerConstant 1 (Integer 4) Decimal)
41                (Integer 4)
42                ()
43            )
44            (Integer 4)
45            ()
46        )
47        Div
48        (Var 30 num_threads)
49        (Integer 4)
50        ()
51    )
52    ()
53    .false.
54)
55(Assignment
56    (Var 30 leftovers)
57    (IntrinsicElementalFunction
58        Mod
59        [(IntegerBinOp
60            (IntegerConstant 1 (Integer 4) Decimal)
61            Mul
62            (IntegerBinOp
63                (IntegerBinOp (Var 30 n) Sub (IntegerConstant 1 (Integer 4) Decimal) (Integer 4) ())
64                Add
65                (IntegerConstant 1 (Integer 4) Decimal)
66                (Integer 4)
67                ()
68            )
69            (Integer 4)
70            ())
71        (Var 30 num_threads)]
72        0
73        (Integer 4)
74        ()
75    )
76    ()
77    .false.
78)
79(Assignment
80    (Var 30 thread_num)
81    (FunctionCall 30 omp_get_thread_num 30 omp_get_thread_num [] (Integer 4) () ())
82    ()
83    .false.
84)
85(Assignment
86    (Var 30 start)
87    (IntegerBinOp (Var 30 chunk) Mul (Var 30 thread_num) (Integer 4) ())
88    ()
89    .false.
90)
91(If
92    (IntegerCompare (Var 30 thread_num) Lt (Var 30 leftovers) (Logical 4) ())
93    [(Assignment
94        (Var 30 start)
95        (IntegerBinOp (Var 30 start) Add (Var 30 thread_num) (Integer 4) ())
96        ()
97        .false.
98    )]
99    [(Assignment
100        (Var 30 start)
101        (IntegerBinOp (Var 30 start) Add (Var 30 leftovers) (Integer 4) ())
102        ()
103        .false.
104    )]
105)
106(Assignment
107    (Var 30 end)
108    (IntegerBinOp (Var 30 start) Add (Var 30 chunk) (Integer 4) ())
109    ()
110    .false.
111)
112(If
113    (IntegerCompare (Var 30 thread_num) Lt (Var 30 leftovers) (Logical 4) ())
114    [(Assignment
115        (Var 30 end)
116        (IntegerBinOp (Var 30 end) Add (IntegerConstant 1 (Integer 4) Decimal) (Integer 4) ())
117        ()
118        .false.
119    )]
120    []
121)
122(Assignment
123    (Var 30 local_ctr)
124    (RealConstant 1.000000 (Real 4))
125    ()
126    .false.
127)
128(DoLoop
129    ()
130    ((Var 30 I)
131    (IntegerBinOp (Var 30 start) Add (IntegerConstant 1 (Integer 4) Decimal) (Integer 4) ())
132    (Var 30 end)
133    ())
134    [(Assignment
135        (Var 30 i)
136        (IntegerBinOp
137            (IntrinsicElementalFunction
138                Mod
139                [(Var 30 I)
140                (IntegerBinOp
141                    (IntegerBinOp (Var 30 n) Sub (IntegerConstant 1 (Integer 4) Decimal) (Integer 4) ())
142                    Add
143                    (IntegerConstant 1 (Integer 4) Decimal)
144                    (Integer 4)
145                    ()
146                )]
147                0
148                (Integer 4)
149                ()
150            )
151            Add
152            (IntegerConstant 1 (Integer 4) Decimal)
153            (Integer 4)
154            ()
155        )
156        ()
157        .false.
158    )
159    (Assignment
160        (Var 30 local_ctr)
161        (RealBinOp (Var 30 local_ctr) Mul (RealConstant 1.500000 (Real 4)) (Real 4) ())
162        ()
163        .false.
164    )]
165    []
166)
167(SubroutineCall 30 gomp_atomic_start () [] ())
168(Assignment
169    (StructInstanceMember (Var 30 tdata) 30 thread_data_local_ctr (Real 4) ())
170    (RealBinOp
171        (StructInstanceMember (Var 30 tdata) 30 thread_data_local_ctr (Real 4) ())
172        Mul
173        (Var 30 local_ctr)
174        (Real 4)
175        ()
176    )
177    ()
178    .false.
179)
180(SubroutineCall 30 gomp_atomic_end () [] ())
181(SubroutineCall 30 gomp_barrier () [] ())]

ASR for openmp_08 Program

View ASR for openmp_08
1[(SubroutineCall
2    27 omp_set_num_threads
3    ()
4    [((IntegerConstant 8 (Integer 4) Decimal))]
5    ()
6)
7(Assignment
8    (Var 27 ctr)
9    (Cast
10        (IntegerConstant 0 (Integer 4) Decimal)
11        IntegerToReal
12        (Real 4)
13        (RealConstant 0.000000 (Real 4))
14    )
15    ()
16    .false.
17)
18(SubroutineCall
19    1 increment_ctr
20    ()
21    [((Var 27 n)) ((Var 27 ctr))]
22    ()
23)
24(Print
25    (StringFormat () [(Var 27 ctr)] FormatFortran (String 1 () ExpressionLength CString) ())
26)
27(If
28    (RealCompare
29        (IntrinsicElementalFunction
30            Abs
31            [(RealBinOp (Var 27 ctr) Sub (Var 27 res) (Real 4) ())]
32            0
33            (Real 4)
34            ()
35        )
36        Gt
37        (RealConstant 0.000200 (Real 4))
38        (Logical 4)
39        ()
40    )
41    [(ErrorStop ())]
42    []
43)]

Next Steps

In Week 4, I plan to focus on the following tasks:

  • While completing this PR I came across various cases where some clauses didn't work in parallel construct, hence it seems that we need to handle the clauses for each, nested as well as combined constructs, for which I will further investigate, report and fix MREs for the same
  • Implement the sections construct using the OMPRegion node, lowering it to GOMP_sections_start and GOMP_sections_end calls.

I would like to thank my mentors, Ondrej Certik, Pranav Goswami, and Gaurav Dhingra, for their critical reviews and guidance, which were instrumental in ensuring the success of this PR. I also thank the other contributors of LFortran for their support and help whenever needed.