Basic Parallel Loop Directive with DO Loops

What You’ll Learn Today

Think of the !$acc parallel loop directive as a magic wand that transforms your regular Fortran DO loop into a super-fast parallel loop that runs on hundreds of GPU cores simultaneously!

The Power of Parallel Loops

Imagine you’re organizing a school library:

Sequential Way (Regular DO loop):

Librarian checks book 1 → files it → checks book 2 → files it → ...
Time: 1000 books × 10 seconds = 10,000 seconds!

Parallel Way (OpenACC DO loop):

100 librarians each check 10 books simultaneously
Time: 10 books × 10 seconds = 100 seconds!

That’s 100× faster! This is exactly what happens when you add !$acc parallel loop to your DO loops.

Understanding Implicit Data Movement

When you use !$acc parallel loop, OpenACC automatically (implicitly) generates the instructions to perform the data transfers :

  1. Copies data TO the GPU (from CPU memory)
  2. Runs your loop (on GPU cores)
  3. Copies results BACK (to CPU memory)
CPU Memory      →      GPU Memory      →      CPU Memory
   [Data]       →       [Data]         →       [Results]
              Copy     Compute      Copy Back

It’s like sending your homework to a super-smart tutoring center that does it really fast and sends it back!

The Basic Syntax

!$acc parallel loop
do i = 1, n
  ! Your calculations here
end do
!$acc end parallel loop

Think of it as putting a “parallel wrapper” around your DO loop:

Regular DO loop:          Parallel DO loop:
                          !$acc parallel loop
do i = 1, n               do i = 1, n
  statement                 statement  
end do                    end do
                         !$acc end parallel loop

Visual: How Your Loop Gets Parallelized

Original Loop:              Parallel Execution:
do i = 1, 8                GPU Core 1: i = 1
  array(i) = i * 2         GPU Core 2: i = 2  
end do                     GPU Core 3: i = 3
                           GPU Core 4: i = 4
                           GPU Core 5: i = 5
                           GPU Core 6: i = 6
                           GPU Core 7: i = 7
                           GPU Core 8: i = 8

                           All running at the same time!

What Operations Work Well?

✅ Perfect for Parallel Loops:

  • Array element operations: array(i) = array(i) * 2
  • Mathematical calculations: result(i) = sin(x(i)) + cos(y(i))
  • Independent transformations: output(i) = input(i) + constant

❌ Not Good for Parallel (yet):

  • Loops with dependencies: array(i) = array(i-1) + value
  • Accumulating sums: total = total + array(i)
  • Sequential algorithms: sorting, searching

Memory Access Patterns

For best performance, access arrays in a simple pattern:

! GOOD: Simple sequential access
do i = 1, n
  a(i) = b(i) + c(i)
end do

! ALSO GOOD: Same index for all arrays
do i = 1, n  
  result(i) = sqrt(x(i) * x(i) + y(i) * y(i))
end do

Understanding Loop Independence

Each iteration of your loop should be like a separate, independent task:

Independent (Good):
Iteration 1: Calculate result(1) using input(1)
Iteration 2: Calculate result(2) using input(2)  
Iteration 3: Calculate result(3) using input(3)
→ No iteration depends on any other!

Dependent (Avoid for now):
Iteration 1: Calculate result(1) 
Iteration 2: Calculate result(2) using result(1)
Iteration 3: Calculate result(3) using result(2)
→ Each iteration needs the previous one!

Key Concepts to Remember

  • Parallel Loop: All iterations run simultaneously on different GPU cores
  • Implicit Data Movement: OpenACC automatically handles memory transfers
  • Independence: Each loop iteration should work on different data
  • DO Loop: Fortran’s loop construct that gets parallelized
  • GPU Cores: Think of them as many workers doing tasks simultaneously

Example Code

Let us consider the following OpenACC code –

program vector_scaling
  ! Vector scaling: multiply every element by a constant
  ! Perfect example of independent parallel operations
  
  implicit none
  
  integer, parameter :: n = 100000
  real :: vector(n)
  real :: scaled_vector(n)
  real :: scale_factor = 3.5
  integer :: i
  real :: start_time, end_time
  
  ! Initialize the vector
  write(*,*) 'Initializing vector with sample data...'
  do i = 1, n
    vector(i) = real(i) * 0.1  ! Simple pattern: 0.1, 0.2, 0.3, ...
  end do
  
  ! Sequential version for comparison
  write(*,*) 'Running sequential vector scaling...'
  call cpu_time(start_time)
  do i = 1, n
    scaled_vector(i) = vector(i) * scale_factor
  end do
  call cpu_time(end_time)
  write(*,'(A,F8.4,A)') 'Sequential time: ', end_time - start_time, ' seconds'
  
  ! Reset for parallel version
  scaled_vector = 0.0
  
  ! Parallel version using OpenACC
  write(*,*) 'Running parallel vector scaling...'
  call cpu_time(start_time)
  !$acc parallel loop
  do i = 1, n
    scaled_vector(i) = vector(i) * scale_factor
  end do
  !$acc end parallel loop
  call cpu_time(end_time)
  write(*,'(A,F8.4,A)') 'Parallel time: ', end_time - start_time, ' seconds'
  
  ! Verify results
  write(*,*) ''
  write(*,*) 'Verification (first 10 elements):'
  write(*,'(A,F6.2)') 'Scale factor: ', scale_factor
  do i = 1, 10
    write(*,'(A,I0,A,F8.2,A,F8.2,A,F8.2)') 'Element ', i, ': ', &
           vector(i), ' × ', scale_factor, ' = ', scaled_vector(i)
  end do
  
  write(*,*) ''
  write(*,*) 'Perfect! Every element was scaled independently and in parallel!'
  
end program vector_scaling

To compile this code –

nvfortran -acc -o vector_scaling vector_scaling.f90

To execute this code –

./vector_scaling

Sample output –

 Initializing vector with sample data...
 Running sequential vector scaling...
Sequential time:   0.0003 seconds
 Running parallel vector scaling...
Parallel time:   0.0897 seconds
 
 Verification (first 10 elements):
Scale factor:   3.50
Element 1:     0.10 ×     3.50 =     0.35
Element 2:     0.20 ×     3.50 =     0.70
Element 3:     0.30 ×     3.50 =     1.05
Element 4:     0.40 ×     3.50 =     1.40
Element 5:     0.50 ×     3.50 =     1.75
Element 6:     0.60 ×     3.50 =     2.10
Element 7:     0.70 ×     3.50 =     2.45
Element 8:     0.80 ×     3.50 =     2.80
Element 9:     0.90 ×     3.50 =     3.15
Element 10:     1.00 ×     3.50 =     3.50
 
 Perfect! Every element was scaled independently and in parallel!

Let us consider another OpenACC code –

program implicit_data_movement
  ! This program demonstrates how OpenACC automatically handles data transfers
  
  implicit none
  
  integer, parameter :: n = 25
  real :: input_data(n)
  real :: output_data(n)
  integer :: i
  
  write(*,*) 'Demonstrating implicit data movement with OpenACC'
  write(*,*) '================================================='
  write(*,*) ''
  
  ! Step 1: Initialize data on CPU
  write(*,*) 'Step 1: Initializing data on CPU...'
  do i = 1, n
    input_data(i) = real(i) * 2.5
  end do
  write(*,*) 'CPU data ready!'
  
  write(*,*) ''
  write(*,*) 'Step 2: Running parallel loop...'
  write(*,*) '(OpenACC will automatically handle data movement)'
  
  ! Step 2: OpenACC automatically:
  ! - Copies input_data from CPU to GPU
  ! - Runs the parallel loop on GPU  
  ! - Copies output_data from GPU back to CPU
  !$acc parallel loop
  do i = 1, n
    output_data(i) = sqrt(input_data(i)) + 10.0
  end do
  !$acc end parallel loop
  
  write(*,*) 'Parallel computation complete!'
  write(*,*) ''
  
  ! Step 3: Results are now available on CPU automatically
  write(*,*) 'Step 3: Results automatically available on CPU:'
  write(*,*) ''
  write(*,*) 'What happened behind the scenes:'
  write(*,*) '1. input_data copied: CPU → GPU'
  write(*,*) '2. Parallel loop executed on GPU'  
  write(*,*) '3. output_data copied: GPU → CPU'
  write(*,*) ''
  
  ! Show some results
  write(*,*) 'Sample results:'
  do i = 1, 10, 2
    write(*,'(A,I0,A,F6.2,A,I0,,A,F8.3)') 'input(', i, ')=', input_data(i), &
           ' → output(', i, ')=', output_data(i)
  end do
  
  write(*,*) ''
  write(*,*) 'Key point: You did not need to write ANY data movement code!'
  write(*,*) 'OpenACC handled all the GPU memory transfers automatically.'
  
end program implicit_data_movement

To compile this code –

nvfortran -acc -o implicit_data_movement implicit_data_movement.f90

To execute this code –

./implicit_data_movement

Sample output –

 Demonstrating implicit data movement with OpenACC
 =================================================
 
 Step 1: Initializing data on CPU...
 CPU data ready!
 
 Step 2: Running parallel loop...
 (OpenACC will automatically handle data movement)
 Parallel computation complete!
 
 Step 3: Results automatically available on CPU:
 
 What happened behind the scenes:
 1. input_data copied: CPU → GPU
 2. Parallel loop executed on GPU
 3. output_data copied: GPU → CPU
 
 Sample results:
input(1)=  2.50 → output(1)=  11.581
input(3)=  7.50 → output(3)=  12.739
input(5)= 12.50 → output(5)=  13.536
input(7)= 17.50 → output(7)=  14.183
input(9)= 22.50 → output(9)=  14.743
 
 Key point: You did not need to write ANY data movement code!
 OpenACC handled all the GPU memory transfers automatically.

Click here to go back to OpenACC Fortran tutorials page.

References


Mandar Gurav Avatar

Mandar Gurav

Parallel Programmer, Trainer and Mentor


If you are new to Parallel Programming you can start here.



Beginner CUDA Fortran Hello World Message Passing Interface MPI Nvidia Nsight Systems NVPROF OpenACC OpenACC Fortran OpenMP PGI Fortran Compiler Profiling Vector Addition


Popular Categories