Basic Parallel Loop Directive with DO Loops

What You’ll Learn Today

Think of the !$acc parallel loop directive as a magic wand that transforms your regular Fortran DO loop into a super-fast parallel loop that runs on hundreds of GPU cores simultaneously!

The Power of Parallel Loops

Imagine you’re organizing a school library:

Sequential Way (Regular DO loop):

Librarian checks book 1 → files it → checks book 2 → files it → ...
Time: 1000 books × 10 seconds = 10,000 seconds!

Parallel Way (OpenACC DO loop):

100 librarians each check 10 books simultaneously
Time: 10 books × 10 seconds = 100 seconds!

That’s 100× faster! This is exactly what happens when you add !$acc parallel loop to your DO loops.

Understanding Implicit Data Movement

When you use !$acc parallel loop, OpenACC automatically (implicitly) generates the instructions to perform the data transfers :

Copies data TO the GPU (from CPU memory)
Runs your loop (on GPU cores)
Copies results BACK (to CPU memory)

CPU Memory      →      GPU Memory      →      CPU Memory
   [Data]       →       [Data]         →       [Results]
              Copy     Compute      Copy Back

It’s like sending your homework to a super-smart tutoring center that does it really fast and sends it back!

The Basic Syntax

!$acc parallel loop
do i = 1, n
  ! Your calculations here
end do
!$acc end parallel loop

Think of it as putting a “parallel wrapper” around your DO loop:

Regular DO loop:          Parallel DO loop:
                          !$acc parallel loop
do i = 1, n               do i = 1, n
  statement                 statement  
end do                    end do
                         !$acc end parallel loop

Visual: How Your Loop Gets Parallelized

Original Loop:              Parallel Execution:
do i = 1, 8                GPU Core 1: i = 1
  array(i) = i * 2         GPU Core 2: i = 2  
end do                     GPU Core 3: i = 3
                           GPU Core 4: i = 4
                           GPU Core 5: i = 5
                           GPU Core 6: i = 6
                           GPU Core 7: i = 7
                           GPU Core 8: i = 8

                           All running at the same time!

What Operations Work Well?

✅ Perfect for Parallel Loops:

Array element operations: array(i) = array(i) * 2
Mathematical calculations: result(i) = sin(x(i)) + cos(y(i))
Independent transformations: output(i) = input(i) + constant

❌ Not Good for Parallel (yet):

Loops with dependencies: array(i) = array(i-1) + value
Accumulating sums: total = total + array(i)
Sequential algorithms: sorting, searching

Memory Access Patterns

For best performance, access arrays in a simple pattern:

! GOOD: Simple sequential access
do i = 1, n
  a(i) = b(i) + c(i)
end do

! ALSO GOOD: Same index for all arrays
do i = 1, n  
  result(i) = sqrt(x(i) * x(i) + y(i) * y(i))
end do

Understanding Loop Independence

Each iteration of your loop should be like a separate, independent task:

Independent (Good):
Iteration 1: Calculate result(1) using input(1)
Iteration 2: Calculate result(2) using input(2)  
Iteration 3: Calculate result(3) using input(3)
→ No iteration depends on any other!

Dependent (Avoid for now):
Iteration 1: Calculate result(1) 
Iteration 2: Calculate result(2) using result(1)
Iteration 3: Calculate result(3) using result(2)
→ Each iteration needs the previous one!

Key Concepts to Remember

Parallel Loop: All iterations run simultaneously on different GPU cores
Implicit Data Movement: OpenACC automatically handles memory transfers
Independence: Each loop iteration should work on different data
DO Loop: Fortran’s loop construct that gets parallelized
GPU Cores: Think of them as many workers doing tasks simultaneously

Example Code

Let us consider the following OpenACC code –

program vector_scaling
  ! Vector scaling: multiply every element by a constant
  ! Perfect example of independent parallel operations
  
  implicit none
  
  integer, parameter :: n = 100000
  real :: vector(n)
  real :: scaled_vector(n)
  real :: scale_factor = 3.5
  integer :: i
  real :: start_time, end_time
  
  ! Initialize the vector
  write(*,*) 'Initializing vector with sample data...'
  do i = 1, n
    vector(i) = real(i) * 0.1  ! Simple pattern: 0.1, 0.2, 0.3, ...
  end do
  
  ! Sequential version for comparison
  write(*,*) 'Running sequential vector scaling...'
  call cpu_time(start_time)
  do i = 1, n
    scaled_vector(i) = vector(i) * scale_factor
  end do
  call cpu_time(end_time)
  write(*,'(A,F8.4,A)') 'Sequential time: ', end_time - start_time, ' seconds'
  
  ! Reset for parallel version
  scaled_vector = 0.0
  
  ! Parallel version using OpenACC
  write(*,*) 'Running parallel vector scaling...'
  call cpu_time(start_time)
  !$acc parallel loop
  do i = 1, n
    scaled_vector(i) = vector(i) * scale_factor
  end do
  !$acc end parallel loop
  call cpu_time(end_time)
  write(*,'(A,F8.4,A)') 'Parallel time: ', end_time - start_time, ' seconds'
  
  ! Verify results
  write(*,*) ''
  write(*,*) 'Verification (first 10 elements):'
  write(*,'(A,F6.2)') 'Scale factor: ', scale_factor
  do i = 1, 10
    write(*,'(A,I0,A,F8.2,A,F8.2,A,F8.2)') 'Element ', i, ': ', &
           vector(i), ' × ', scale_factor, ' = ', scaled_vector(i)
  end do
  
  write(*,*) ''
  write(*,*) 'Perfect! Every element was scaled independently and in parallel!'
  
end program vector_scaling

To compile this code –

nvfortran -acc -o vector_scaling vector_scaling.f90

To execute this code –

./vector_scaling

Sample output –

 Initializing vector with sample data...
 Running sequential vector scaling...
Sequential time:   0.0003 seconds
 Running parallel vector scaling...
Parallel time:   0.0897 seconds
 
 Verification (first 10 elements):
Scale factor:   3.50
Element 1:     0.10 ×     3.50 =     0.35
Element 2:     0.20 ×     3.50 =     0.70
Element 3:     0.30 ×     3.50 =     1.05
Element 4:     0.40 ×     3.50 =     1.40
Element 5:     0.50 ×     3.50 =     1.75
Element 6:     0.60 ×     3.50 =     2.10
Element 7:     0.70 ×     3.50 =     2.45
Element 8:     0.80 ×     3.50 =     2.80
Element 9:     0.90 ×     3.50 =     3.15
Element 10:     1.00 ×     3.50 =     3.50
 
 Perfect! Every element was scaled independently and in parallel!

Let us consider another OpenACC code –

program implicit_data_movement
  ! This program demonstrates how OpenACC automatically handles data transfers
  
  implicit none
  
  integer, parameter :: n = 25
  real :: input_data(n)
  real :: output_data(n)
  integer :: i
  
  write(*,*) 'Demonstrating implicit data movement with OpenACC'
  write(*,*) '================================================='
  write(*,*) ''
  
  ! Step 1: Initialize data on CPU
  write(*,*) 'Step 1: Initializing data on CPU...'
  do i = 1, n
    input_data(i) = real(i) * 2.5
  end do
  write(*,*) 'CPU data ready!'
  
  write(*,*) ''
  write(*,*) 'Step 2: Running parallel loop...'
  write(*,*) '(OpenACC will automatically handle data movement)'
  
  ! Step 2: OpenACC automatically:
  ! - Copies input_data from CPU to GPU
  ! - Runs the parallel loop on GPU  
  ! - Copies output_data from GPU back to CPU
  !$acc parallel loop
  do i = 1, n
    output_data(i) = sqrt(input_data(i)) + 10.0
  end do
  !$acc end parallel loop
  
  write(*,*) 'Parallel computation complete!'
  write(*,*) ''
  
  ! Step 3: Results are now available on CPU automatically
  write(*,*) 'Step 3: Results automatically available on CPU:'
  write(*,*) ''
  write(*,*) 'What happened behind the scenes:'
  write(*,*) '1. input_data copied: CPU → GPU'
  write(*,*) '2. Parallel loop executed on GPU'  
  write(*,*) '3. output_data copied: GPU → CPU'
  write(*,*) ''
  
  ! Show some results
  write(*,*) 'Sample results:'
  do i = 1, 10, 2
    write(*,'(A,I0,A,F6.2,A,I0,,A,F8.3)') 'input(', i, ')=', input_data(i), &
           ' → output(', i, ')=', output_data(i)
  end do
  
  write(*,*) ''
  write(*,*) 'Key point: You did not need to write ANY data movement code!'
  write(*,*) 'OpenACC handled all the GPU memory transfers automatically.'
  
end program implicit_data_movement