Building on the progress made in Week 2, where I implemented the foundational structure for the OMPRegion
ASR node and adopted a stack-based approach to handle nested OpenMP constructs, Week 3 focused on integrating the existing parallel do
logic into this new design. In my previous blog post, I outlined a plan to extend the OpenMP pass to visit the OMPRegion
node and support the sections
construct. However, I decided to first shift the existing parallel do
logic to use the OMPRegion
node, as this would lay the groundwork for implementing all other constructs, including sections
. This week, I successfully completed this transition through PR #7593, dedicating approximately 32 hours to ensure no regression in existing test cases while extending the OpenMP pass to handle parallel
and do
constructs in both nested and combined forms.
Choice of Shifting Parallel Do Logic First
Initially, my plan was to implement the sections
construct in OpenMp pass, as it represents a non-loop-based paradigm distinct from the existing parallel do
implementation. However, upon further consideration, I realized that implementing sections
would anyway require support for the standalone parallel
construct, which is a fundamental component of many OpenMP directives, including parallel do
. So, rather than diving straight into sections
construct I chose to first shift the existing parallel do
logic from converting directly to a DoConcurrentLoop
node to the more adaptable OMPRegion
node and then lower down this new ASR node in the OpenMp pass.
This shift was crucial for several reasons. First, it ensures a unified representation of all OpenMP constructs under the OMPRegion
node, facilitating future extensions for constructs like sections
and teams
. Second, I needed to guarantee that all existing OpenMP test cases, which currently compile successfully, would not break during this transition. I anticipated that this would be a longer PR, as it required updating all reference test cases to reflect the new ASR node and ensuring that the extended OpenMP pass could handle the updated structure without regression.
Implementation Details and Results
In PR #7593, I made significant changes to de-couple the DoConcurrentLoop
logic from OMP Pragmas
. The key changes included:
- Removed the older logic in the AST -> ASR conversion that directly converted
parallel do
pragmas to aDoConcurrentLoop
node, replacing it with a conversion to anOMPRegion
node. - Extended the OpenMP pass to visit the
OMPRegion
node, handlingparallel do
constructs in a manner equivalent to the previousDoConcurrentLoop
approach with moderate changes. - Updated all ASR of all existing OpenMP test cases to reflect the new changes, verifying that each test case compiles successfully with the extended OpenMP pass, thereby ensuring no regression in existing functionality.
The results of this implementation are significant. First, the transition to the OMPRegion
node allows LFortran to handle standalone parallel
constructs, as well as parallel do
in both nested and combined forms, without disrupting existing test cases. For example, a standalone parallel
region can now be represented and executed correctly, also nested constructs like parallel
containing a do
loop, or combined constructs like parallel do
are too handled. Additionally, intermediate statements within parallel regions, which previously posed challenges (e.g., as seen in openmp_06.f90
), are now handled seamlessly, addressing issues like those in Issue #4147. This PR also lays the groundwork for implementing other constructs like sections
, for which placeholder functions have been added to demonstrate the intended approach.
Example: Parallel Do Construct with Reduction
To illustrate the new OMPRegion
implementation for the parallel do
construct, consider the following example program, openmp_08.f90
, which uses a parallel do
construct with a reduction
clause:
View Original Fortran Code
1subroutine increment_ctr(n, ctr)
2 use omp_lib
3 implicit none
4 integer, intent(in) :: n
5 real, intent(out) :: ctr
6
7 real :: local_ctr
8
9 integer :: i
10
11 local_ctr = 1
12 !$omp parallel private(i) reduction(*:local_ctr)
13 !$omp do
14 do i = 1, n
15 local_ctr = local_ctr * 1.5
16 end do
17 !$omp end do
18 !$omp end parallel
19
20 ctr = ctr + local_ctr
21end subroutine
22
23program openmp_08
24 use omp_lib
25 integer, parameter :: n = 10
26 real :: ctr
27 real :: res = 1.5**10
28
29 call omp_set_num_threads(8)
30 ctr = 0
31 call increment_ctr(n, ctr)
32 print *, ctr
33 if(abs((ctr - res)) > 0.0002 ) error stop
34end program
The OpenMP pass now converts this parallel do
construct into an OMPRegion
node and lowers it to use GOMP runtime calls. The equivalent lowered Fortran code, which explicitly uses GOMP calls to achieve the same parallelism, is shown below:
View Lowered Fortran Code Using GOMP Calls
1module thread_data_module
2 use, intrinsic :: iso_c_binding
3 implicit none
4 type, bind(C) :: thread_data
5 integer(c_int) :: i
6 real(c_float) :: local_ctr
7 integer(c_int) :: n
8 end type thread_data
9end module thread_data_module
10
11module omp_lib
12 use iso_c_binding
13 implicit none
14 interface
15 subroutine GOMP_parallel(fn, data, num_threads, flags) bind(C, name="GOMP_parallel")
16 import :: c_funptr, c_ptr, c_int
17 type(c_funptr), value :: fn
18 type(c_ptr), value :: data
19 integer(c_int), value :: num_threads
20 integer(c_int), value :: flags
21 end subroutine
22 subroutine GOMP_atomic_start() bind(C, name="GOMP_atomic_start")
23 end subroutine
24 subroutine GOMP_atomic_end() bind(C, name="GOMP_atomic_end")
25 end subroutine
26 subroutine GOMP_barrier() bind(C, name="GOMP_barrier")
27 end subroutine
28 function omp_get_max_threads() bind(C, name="omp_get_max_threads")
29 import :: c_int
30 integer(c_int) :: omp_get_max_threads
31 end function
32 function omp_get_thread_num() bind(C, name="omp_get_thread_num")
33 import :: c_int
34 integer(c_int) :: omp_get_thread_num
35 end function
36 subroutine omp_set_num_threads(n) bind(C, name="omp_set_num_threads")
37 import :: c_int
38 integer(c_int), value :: n
39 end subroutine
40 end interface
41end module omp_lib
42
43subroutine lcompilers_parallel_func(data) bind(C)
44 use thread_data_module
45 use omp_lib
46 implicit none
47 type(c_ptr), value :: data
48 type(thread_data), pointer :: tdata
49 integer :: i, n, num_threads, thread_num, start, end, chunk, leftovers, I
50 real :: local_ctr
51
52 call c_f_pointer(data, tdata)
53
54 ! Extract variables from thread_data
55 i = tdata%i
56 local_ctr = tdata%local_ctr
57 n = tdata%n
58
59 ! Thread partitioning logic
60 num_threads = omp_get_max_threads()
61 chunk = (1 * ((n - 1) + 1)) / num_threads
62 leftovers = mod(1 * ((n - 1) + 1), num_threads)
63 thread_num = omp_get_thread_num()
64
65 start = chunk * thread_num
66 if (thread_num < leftovers) then
67 start = start + thread_num
68 else
69 start = start + leftovers
70 end if
71
72 end = start + chunk
73 if (thread_num < leftovers) then
74 end = end + 1
75 end if
76
77 ! Initialize local reduction variable
78 local_ctr = 1.0
79
80 ! Loop over assigned iterations
81 do I = start + 1, end
82 i = mod(I, (n - 1) + 1) + 1
83 local_ctr = local_ctr * 1.5
84 end do
85
86 ! Atomic update for reduction
87 call GOMP_atomic_start()
88 tdata%local_ctr = tdata%local_ctr * local_ctr
89 call GOMP_atomic_end()
90
91 ! Barrier synchronization
92 call GOMP_barrier()
93end subroutine lcompilers_parallel_func
94
95subroutine increment_ctr(n, ctr)
96 use thread_data_module
97 use omp_lib
98 implicit none
99 integer, intent(in) :: n
100 real, intent(out) :: ctr
101
102 real :: local_ctr
103 integer :: i
104 type(thread_data), target :: data
105 type(c_ptr) :: tdata
106
107 local_ctr = 1.0
108
109 ! Populate thread_data structure
110 data%i = i
111 data%local_ctr = local_ctr
112 data%n = n
113
114 ! Convert to C pointer
115 tdata = c_loc(data)
116
117 ! Call GOMP_parallel to execute the parallel region
118 call GOMP_parallel(c_funloc(lcompilers_parallel_func), tdata, 0, 0)
119
120 ! Retrieve updated local_ctr after parallel execution
121 local_ctr = data%local_ctr
122
123 ! Update output variable
124 ctr = ctr + local_ctr
125end subroutine increment_ctr
126
127program openmp_08
128 use omp_lib
129 implicit none
130 integer, parameter :: n = 10
131 real :: ctr
132 real :: res = 1.5**10
133
134 call omp_set_num_threads(8)
135 ctr = 0.0
136 call increment_ctr(n, ctr)
137 print *, ctr
138 if (abs(ctr - res) > 0.0002) then
139 error stop
140 end if
141end program openmp_08
In the lowered code, the parallel do
construct is replaced with explicit GOMP runtime calls. A thread_data
structure is used to pass variables between the main function and the parallel function, ensuring thread-safe access. The GOMP_parallel
call invokes the parallel region by executing lcompilers_parallel_func
, which handles thread partitioning by calculating each thread’s start
and end
indices based on the number of threads and loop iterations. The reduction(*:local_ctr)
clause is implemented using GOMP_atomic_start
and GOMP_atomic_end
to atomically update the shared local_ctr
variable, and a GOMP_barrier
ensures all threads synchronize before proceeding.
Below are the relevant portions of the ASR generated by OpenMp pass, for the increment_ctr
subroutine, the parallel function lcompilers_parallel_func
, and the main program openmp_08
:
ASR for increment_ctr
Subroutine
View ASR for increment_ctr
1[(Assignment
2 (Var 2 local_ctr)
3 (Cast
4 (IntegerConstant 1 (Integer 4) Decimal)
5 IntegerToReal
6 (Real 4)
7 (RealConstant 1.000000 (Real 4))
8 )
9 ()
10 .false.
11)
12(Assignment
13 (StructInstanceMember (Var 2 data) 2 thread_data_i (Integer 4) ())
14 (Var 2 i)
15 ()
16 .false.
17)
18(Assignment
19 (StructInstanceMember (Var 2 data) 2 thread_data_local_ctr (Real 4) ())
20 (Var 2 local_ctr)
21 ()
22 .false.
23)
24(Assignment
25 (StructInstanceMember (Var 2 data) 2 thread_data_n (Integer 4) ())
26 (Var 2 n)
27 ()
28 .false.
29)
30(Assignment
31 (Var 2 tdata)
32 (PointerToCPtr
33 (GetPointer (Var 2 data) (Pointer (StructType [] [] .true. 2 thread_data)))
34 (CPtr)
35 ()
36 )
37 ()
38 .false.
39)
40(SubroutineCall
41 2 gomp_parallel
42 ()
43 [((PointerToCPtr
44 (GetPointer (Var 2 lcompilers_parallel_func)
45 (Pointer (FunctionType [(CPtr)] () BindC Interface () .false. .false. .false. .false. .false. [] .false.)))
46 (CPtr)
47 ()
48 ))
49 ((Var 2 tdata))
50 ((IntegerConstant 0 (Integer 4) Decimal))
51 ((IntegerConstant 0 (Integer 4) Decimal))]
52 ()
53)
54(Assignment
55 (Var 2 local_ctr)
56 (StructInstanceMember (Var 2 data) 2 thread_data_local_ctr (Real 4) ())
57 ()
58 .false.
59)
60(Assignment
61 (Var 2 ctr)
62 (RealBinOp (Var 2 ctr) Add (Var 2 local_ctr) (Real 4) ())
63 ()
64 .false.
65)]
ASR for lcompilers_parallel_func
View ASR for lcompilers_parallel_func
1[(CPtrToPointer
2 (Var 30 data)
3 (Var 30 tdata)
4 ()
5 ()
6)
7(Assignment
8 (Var 30 i)
9 (StructInstanceMember (Var 30 tdata) 30 thread_data_i (Integer 4) ())
10 ()
11 .false.
12)
13(Assignment
14 (Var 30 local_ctr)
15 (StructInstanceMember (Var 30 tdata) 30 thread_data_local_ctr (Real 4) ())
16 ()
17 .false.
18)
19(Assignment
20 (Var 30 n)
21 (StructInstanceMember (Var 30 tdata) 30 thread_data_n (Integer 4) ())
22 ()
23 .false.
24)
25(Assignment
26 (Var 30 num_threads)
27 (FunctionCall 30 omp_get_max_threads 30 omp_get_max_threads [] (Integer 4) () ())
28 ()
29 .false.
30)
31(Assignment
32 (Var 30 chunk)
33 (IntegerBinOp
34 (IntegerBinOp
35 (IntegerConstant 1 (Integer 4) Decimal)
36 Mul
37 (IntegerBinOp
38 (IntegerBinOp (Var 30 n) Sub (IntegerConstant 1 (Integer 4) Decimal) (Integer 4) ())
39 Add
40 (IntegerConstant 1 (Integer 4) Decimal)
41 (Integer 4)
42 ()
43 )
44 (Integer 4)
45 ()
46 )
47 Div
48 (Var 30 num_threads)
49 (Integer 4)
50 ()
51 )
52 ()
53 .false.
54)
55(Assignment
56 (Var 30 leftovers)
57 (IntrinsicElementalFunction
58 Mod
59 [(IntegerBinOp
60 (IntegerConstant 1 (Integer 4) Decimal)
61 Mul
62 (IntegerBinOp
63 (IntegerBinOp (Var 30 n) Sub (IntegerConstant 1 (Integer 4) Decimal) (Integer 4) ())
64 Add
65 (IntegerConstant 1 (Integer 4) Decimal)
66 (Integer 4)
67 ()
68 )
69 (Integer 4)
70 ())
71 (Var 30 num_threads)]
72 0
73 (Integer 4)
74 ()
75 )
76 ()
77 .false.
78)
79(Assignment
80 (Var 30 thread_num)
81 (FunctionCall 30 omp_get_thread_num 30 omp_get_thread_num [] (Integer 4) () ())
82 ()
83 .false.
84)
85(Assignment
86 (Var 30 start)
87 (IntegerBinOp (Var 30 chunk) Mul (Var 30 thread_num) (Integer 4) ())
88 ()
89 .false.
90)
91(If
92 (IntegerCompare (Var 30 thread_num) Lt (Var 30 leftovers) (Logical 4) ())
93 [(Assignment
94 (Var 30 start)
95 (IntegerBinOp (Var 30 start) Add (Var 30 thread_num) (Integer 4) ())
96 ()
97 .false.
98 )]
99 [(Assignment
100 (Var 30 start)
101 (IntegerBinOp (Var 30 start) Add (Var 30 leftovers) (Integer 4) ())
102 ()
103 .false.
104 )]
105)
106(Assignment
107 (Var 30 end)
108 (IntegerBinOp (Var 30 start) Add (Var 30 chunk) (Integer 4) ())
109 ()
110 .false.
111)
112(If
113 (IntegerCompare (Var 30 thread_num) Lt (Var 30 leftovers) (Logical 4) ())
114 [(Assignment
115 (Var 30 end)
116 (IntegerBinOp (Var 30 end) Add (IntegerConstant 1 (Integer 4) Decimal) (Integer 4) ())
117 ()
118 .false.
119 )]
120 []
121)
122(Assignment
123 (Var 30 local_ctr)
124 (RealConstant 1.000000 (Real 4))
125 ()
126 .false.
127)
128(DoLoop
129 ()
130 ((Var 30 I)
131 (IntegerBinOp (Var 30 start) Add (IntegerConstant 1 (Integer 4) Decimal) (Integer 4) ())
132 (Var 30 end)
133 ())
134 [(Assignment
135 (Var 30 i)
136 (IntegerBinOp
137 (IntrinsicElementalFunction
138 Mod
139 [(Var 30 I)
140 (IntegerBinOp
141 (IntegerBinOp (Var 30 n) Sub (IntegerConstant 1 (Integer 4) Decimal) (Integer 4) ())
142 Add
143 (IntegerConstant 1 (Integer 4) Decimal)
144 (Integer 4)
145 ()
146 )]
147 0
148 (Integer 4)
149 ()
150 )
151 Add
152 (IntegerConstant 1 (Integer 4) Decimal)
153 (Integer 4)
154 ()
155 )
156 ()
157 .false.
158 )
159 (Assignment
160 (Var 30 local_ctr)
161 (RealBinOp (Var 30 local_ctr) Mul (RealConstant 1.500000 (Real 4)) (Real 4) ())
162 ()
163 .false.
164 )]
165 []
166)
167(SubroutineCall 30 gomp_atomic_start () [] ())
168(Assignment
169 (StructInstanceMember (Var 30 tdata) 30 thread_data_local_ctr (Real 4) ())
170 (RealBinOp
171 (StructInstanceMember (Var 30 tdata) 30 thread_data_local_ctr (Real 4) ())
172 Mul
173 (Var 30 local_ctr)
174 (Real 4)
175 ()
176 )
177 ()
178 .false.
179)
180(SubroutineCall 30 gomp_atomic_end () [] ())
181(SubroutineCall 30 gomp_barrier () [] ())]
ASR for openmp_08
Program
View ASR for openmp_08
1[(SubroutineCall
2 27 omp_set_num_threads
3 ()
4 [((IntegerConstant 8 (Integer 4) Decimal))]
5 ()
6)
7(Assignment
8 (Var 27 ctr)
9 (Cast
10 (IntegerConstant 0 (Integer 4) Decimal)
11 IntegerToReal
12 (Real 4)
13 (RealConstant 0.000000 (Real 4))
14 )
15 ()
16 .false.
17)
18(SubroutineCall
19 1 increment_ctr
20 ()
21 [((Var 27 n)) ((Var 27 ctr))]
22 ()
23)
24(Print
25 (StringFormat () [(Var 27 ctr)] FormatFortran (String 1 () ExpressionLength CString) ())
26)
27(If
28 (RealCompare
29 (IntrinsicElementalFunction
30 Abs
31 [(RealBinOp (Var 27 ctr) Sub (Var 27 res) (Real 4) ())]
32 0
33 (Real 4)
34 ()
35 )
36 Gt
37 (RealConstant 0.000200 (Real 4))
38 (Logical 4)
39 ()
40 )
41 [(ErrorStop ())]
42 []
43)]
Next Steps
In Week 4, I plan to focus on the following tasks:
- While completing this PR I came across various cases where some clauses didn't work in parallel construct, hence it seems that we need to handle the clauses for each, nested as well as combined constructs, for which I will further investigate, report and fix MREs for the same
- Implement the
sections
construct using theOMPRegion
node, lowering it toGOMP_sections_start
andGOMP_sections_end
calls.
I would like to thank my mentors, Ondrej Certik, Pranav Goswami, and Gaurav Dhingra, for their critical reviews and guidance, which were instrumental in ensuring the success of this PR. I also thank the other contributors of LFortran for their support and help whenever needed.