What You’ll Learn Today
Think of the !$acc parallel loop
directive as a magic wand that transforms your regular Fortran DO loop into a super-fast parallel loop that runs on hundreds of GPU cores simultaneously!
The Power of Parallel Loops
Imagine you’re organizing a school library:
Sequential Way (Regular DO loop):
Librarian checks book 1 → files it → checks book 2 → files it → ...
Time: 1000 books × 10 seconds = 10,000 seconds!
Parallel Way (OpenACC DO loop):
100 librarians each check 10 books simultaneously
Time: 10 books × 10 seconds = 100 seconds!
That’s 100× faster! This is exactly what happens when you add !$acc parallel loop
to your DO loops.
Understanding Implicit Data Movement
When you use !$acc parallel loop
, OpenACC automatically (implicitly) generates the instructions to perform the data transfers :
- Copies data TO the GPU (from CPU memory)
- Runs your loop (on GPU cores)
- Copies results BACK (to CPU memory)
CPU Memory → GPU Memory → CPU Memory
[Data] → [Data] → [Results]
Copy Compute Copy Back
It’s like sending your homework to a super-smart tutoring center that does it really fast and sends it back!
The Basic Syntax
!$acc parallel loop
do i = 1, n
! Your calculations here
end do
!$acc end parallel loop
Think of it as putting a “parallel wrapper” around your DO loop:
Regular DO loop: Parallel DO loop:
!$acc parallel loop
do i = 1, n do i = 1, n
statement statement
end do end do
!$acc end parallel loop
Visual: How Your Loop Gets Parallelized
Original Loop: Parallel Execution:
do i = 1, 8 GPU Core 1: i = 1
array(i) = i * 2 GPU Core 2: i = 2
end do GPU Core 3: i = 3
GPU Core 4: i = 4
GPU Core 5: i = 5
GPU Core 6: i = 6
GPU Core 7: i = 7
GPU Core 8: i = 8
All running at the same time!
What Operations Work Well?
✅ Perfect for Parallel Loops:
- Array element operations:
array(i) = array(i) * 2
- Mathematical calculations:
result(i) = sin(x(i)) + cos(y(i))
- Independent transformations:
output(i) = input(i) + constant
❌ Not Good for Parallel (yet):
- Loops with dependencies:
array(i) = array(i-1) + value
- Accumulating sums:
total = total + array(i)
- Sequential algorithms: sorting, searching
Memory Access Patterns
For best performance, access arrays in a simple pattern:
! GOOD: Simple sequential access
do i = 1, n
a(i) = b(i) + c(i)
end do
! ALSO GOOD: Same index for all arrays
do i = 1, n
result(i) = sqrt(x(i) * x(i) + y(i) * y(i))
end do
Understanding Loop Independence
Each iteration of your loop should be like a separate, independent task:
Independent (Good):
Iteration 1: Calculate result(1) using input(1)
Iteration 2: Calculate result(2) using input(2)
Iteration 3: Calculate result(3) using input(3)
→ No iteration depends on any other!
Dependent (Avoid for now):
Iteration 1: Calculate result(1)
Iteration 2: Calculate result(2) using result(1)
Iteration 3: Calculate result(3) using result(2)
→ Each iteration needs the previous one!
Key Concepts to Remember
- Parallel Loop: All iterations run simultaneously on different GPU cores
- Implicit Data Movement: OpenACC automatically handles memory transfers
- Independence: Each loop iteration should work on different data
- DO Loop: Fortran’s loop construct that gets parallelized
- GPU Cores: Think of them as many workers doing tasks simultaneously
Example Code
Let us consider the following OpenACC code –
program vector_scaling
! Vector scaling: multiply every element by a constant
! Perfect example of independent parallel operations
implicit none
integer, parameter :: n = 100000
real :: vector(n)
real :: scaled_vector(n)
real :: scale_factor = 3.5
integer :: i
real :: start_time, end_time
! Initialize the vector
write(*,*) 'Initializing vector with sample data...'
do i = 1, n
vector(i) = real(i) * 0.1 ! Simple pattern: 0.1, 0.2, 0.3, ...
end do
! Sequential version for comparison
write(*,*) 'Running sequential vector scaling...'
call cpu_time(start_time)
do i = 1, n
scaled_vector(i) = vector(i) * scale_factor
end do
call cpu_time(end_time)
write(*,'(A,F8.4,A)') 'Sequential time: ', end_time - start_time, ' seconds'
! Reset for parallel version
scaled_vector = 0.0
! Parallel version using OpenACC
write(*,*) 'Running parallel vector scaling...'
call cpu_time(start_time)
!$acc parallel loop
do i = 1, n
scaled_vector(i) = vector(i) * scale_factor
end do
!$acc end parallel loop
call cpu_time(end_time)
write(*,'(A,F8.4,A)') 'Parallel time: ', end_time - start_time, ' seconds'
! Verify results
write(*,*) ''
write(*,*) 'Verification (first 10 elements):'
write(*,'(A,F6.2)') 'Scale factor: ', scale_factor
do i = 1, 10
write(*,'(A,I0,A,F8.2,A,F8.2,A,F8.2)') 'Element ', i, ': ', &
vector(i), ' × ', scale_factor, ' = ', scaled_vector(i)
end do
write(*,*) ''
write(*,*) 'Perfect! Every element was scaled independently and in parallel!'
end program vector_scaling
To compile this code –
nvfortran -acc -o vector_scaling vector_scaling.f90
To execute this code –
./vector_scaling
Sample output –
Initializing vector with sample data...
Running sequential vector scaling...
Sequential time: 0.0003 seconds
Running parallel vector scaling...
Parallel time: 0.0897 seconds
Verification (first 10 elements):
Scale factor: 3.50
Element 1: 0.10 × 3.50 = 0.35
Element 2: 0.20 × 3.50 = 0.70
Element 3: 0.30 × 3.50 = 1.05
Element 4: 0.40 × 3.50 = 1.40
Element 5: 0.50 × 3.50 = 1.75
Element 6: 0.60 × 3.50 = 2.10
Element 7: 0.70 × 3.50 = 2.45
Element 8: 0.80 × 3.50 = 2.80
Element 9: 0.90 × 3.50 = 3.15
Element 10: 1.00 × 3.50 = 3.50
Perfect! Every element was scaled independently and in parallel!
Let us consider another OpenACC code –
program implicit_data_movement
! This program demonstrates how OpenACC automatically handles data transfers
implicit none
integer, parameter :: n = 25
real :: input_data(n)
real :: output_data(n)
integer :: i
write(*,*) 'Demonstrating implicit data movement with OpenACC'
write(*,*) '================================================='
write(*,*) ''
! Step 1: Initialize data on CPU
write(*,*) 'Step 1: Initializing data on CPU...'
do i = 1, n
input_data(i) = real(i) * 2.5
end do
write(*,*) 'CPU data ready!'
write(*,*) ''
write(*,*) 'Step 2: Running parallel loop...'
write(*,*) '(OpenACC will automatically handle data movement)'
! Step 2: OpenACC automatically:
! - Copies input_data from CPU to GPU
! - Runs the parallel loop on GPU
! - Copies output_data from GPU back to CPU
!$acc parallel loop
do i = 1, n
output_data(i) = sqrt(input_data(i)) + 10.0
end do
!$acc end parallel loop
write(*,*) 'Parallel computation complete!'
write(*,*) ''
! Step 3: Results are now available on CPU automatically
write(*,*) 'Step 3: Results automatically available on CPU:'
write(*,*) ''
write(*,*) 'What happened behind the scenes:'
write(*,*) '1. input_data copied: CPU → GPU'
write(*,*) '2. Parallel loop executed on GPU'
write(*,*) '3. output_data copied: GPU → CPU'
write(*,*) ''
! Show some results
write(*,*) 'Sample results:'
do i = 1, 10, 2
write(*,'(A,I0,A,F6.2,A,I0,,A,F8.3)') 'input(', i, ')=', input_data(i), &
' → output(', i, ')=', output_data(i)
end do
write(*,*) ''
write(*,*) 'Key point: You did not need to write ANY data movement code!'
write(*,*) 'OpenACC handled all the GPU memory transfers automatically.'
end program implicit_data_movement
To compile this code –
nvfortran -acc -o implicit_data_movement implicit_data_movement.f90
To execute this code –
./implicit_data_movement
Sample output –
Demonstrating implicit data movement with OpenACC
=================================================
Step 1: Initializing data on CPU...
CPU data ready!
Step 2: Running parallel loop...
(OpenACC will automatically handle data movement)
Parallel computation complete!
Step 3: Results automatically available on CPU:
What happened behind the scenes:
1. input_data copied: CPU → GPU
2. Parallel loop executed on GPU
3. output_data copied: GPU → CPU
Sample results:
input(1)= 2.50 → output(1)= 11.581
input(3)= 7.50 → output(3)= 12.739
input(5)= 12.50 → output(5)= 13.536
input(7)= 17.50 → output(7)= 14.183
input(9)= 22.50 → output(9)= 14.743
Key point: You did not need to write ANY data movement code!
OpenACC handled all the GPU memory transfers automatically.
Click here to go back to OpenACC Fortran tutorials page.
References
- OpenACC Specification : https://www.openacc.org/specification