Loop Directive with Num_gangs, Num_workers, Vector_length

What You’ll Learn Today

Imagine a construction site where you need to organize workers efficiently. You have teams (gangs), workers within each team, and each worker can handle multiple small tasks (vectors). That’s exactly how GPU parallelism works with OpenACC!

The Three-Level Parallelism Hierarchy

GPUs organize work in three levels, like a well-structured company:

GANG Level (Teams):         [Team 1]   [Team 2]   [Team 3]   [Team 4]
                               ↓          ↓          ↓          ↓
WORKER Level (Employees):   [W1 W2]    [W1 W2]    [W1 W2]    [W1 W2]  
                               ↓          ↓          ↓          ↓
VECTOR Level (Tasks):      [T T T T]  [T T T T]  [T T T T]  [T T T T]

Each level handles a different amount of work:

Gangs: Big chunks of work (like different building floors)
Workers: Medium chunks within each gang (like rooms on a floor)
Vectors: Small chunks within each worker (like tiles in a room)

Understanding GPU Hardware

Think of your GPU like a large office building:

GPU Building:
┌─────────────────────────────────────────┐
│  Gang 1        Gang 2        Gang 3     │ ← Each floor is a Gang
│ ┌─────────┐   ┌─────────┐   ┌─────────┐ │
│ │W1  │ W2 │   │W1  │ W2 │   │W1  │ W2 │ │ ← Workers in each Gang
│ │████│████│   │████│████│   │████│████│ │ ← Vector units (cores)
│ └─────────┘   └─────────┘   └─────────┘ │
└─────────────────────────────────────────┘

Controlling Parallelism

You can tell OpenACC exactly how to organize the work:

!$acc parallel loop num_gangs(4) num_workers(8) vector_length(32)
do i = 1, n
  array(i) = array(i) * 2
end do
!$acc end parallel loop

This creates:

4 gangs (teams working independently)
8 workers per gang (32 total workers)
32 vector elements per worker (1024 total parallel operations!)

Visual: How Work Gets Distributed

Original Loop: do i = 1, 1024

Gang Distribution (4 gangs):
Gang 1: i = 1   to 256
Gang 2: i = 257 to 512  
Gang 3: i = 513 to 768
Gang 4: i = 769 to 1024

Worker Distribution (8 workers per gang):
Gang 1, Worker 1: i = 1   to 32
Gang 1, Worker 2: i = 33  to 64
...

Vector Distribution (32 elements per worker):
Gang 1, Worker 1: processes i=1,2,3,...,32 simultaneously!

When to Use Each Level

Use more GANGS when:

You have lots of independent work
Your problem size is very large
You want maximum parallelism

Use more WORKERS when:

You need coordination within gangs
You’re working with shared memory
Your algorithm has some dependencies

Use larger VECTOR_LENGTH when:

You’re doing simple mathematical operations
Your data access is very regular
You want fine-grained parallelism

Default vs Custom Settings

Default (Let OpenACC decide):

!$acc parallel loop  ! OpenACC chooses best settings

Custom (You decide):

!$acc parallel loop num_gangs(8) num_workers(4) vector_length(128)

Think of default as “automatic transmission” and custom as “manual transmission” – both work, but manual gives you more control!

Matrix Operations Example

For a 2D matrix, you might organize work like this:

Matrix(100,100):

Option 1: Gangs handle rows
!$acc parallel loop num_gangs(100)
do i = 1, 100      ! Each gang handles one row
  do j = 1, 100    ! Sequential within each row
    matrix(i,j) = ...
  end do
end do

Option 2: Nested parallelism  
!$acc parallel loop gang num_gangs(10)
do i = 1, 100      ! 10 gangs, each handles 10 rows
  !$acc loop worker  
  do j = 1, 100    ! Workers handle columns
    matrix(i,j) = ...
  end do
end do

Performance Tuning Tips

Start Simple:

!$acc parallel loop  ! Let OpenACC decide first

Then Experiment:

!$acc parallel loop num_gangs(4)    ! Try different gang counts
!$acc parallel loop num_gangs(8)    
!$acc parallel loop num_gangs(16)

Measure Performance:

Time your code with different settings
Use what works best for YOUR problem
Different GPUs prefer different settings

Key Terms to Remember

Gang: Independent team working on part of the problem
Worker: Member of a gang, can share some resources
Vector: Fine-grained parallelism within a worker
Num_gangs: How many independent teams to create
Num_workers: How many workers in each team
Vector_length: How many elements each worker processes simultaneously

Example Code

Let us consider the following OpenACC code –

program matrix_operation
  ! Demonstrates different parallelism levels with matrix operations
  
  implicit none
  
  integer, parameter :: rows = 1000, cols = 1000
  real :: matrix_a(rows, cols)
  real :: matrix_b(rows, cols)  
  real :: result_default(rows, cols)
  real :: result_custom(rows, cols)
  integer :: i, j
  real :: start_time, end_time
  
  write(*,*) 'Matrix Operation with Different Parallelism Levels'
  write(*,*) '=================================================='
  write(*,'(A,I0,A,I0)') 'Matrix size: ', rows, ' × ', cols
  write(*,*) ''
  
  ! Initialize matrices
  write(*,*) 'Initializing matrices...'
  do i = 1, rows
    do j = 1, cols
      matrix_a(i,j) = real(i) + real(j) * 0.1
      matrix_b(i,j) = real(i) * 0.5 + real(j)
    end do
  end do
  
  ! Method 1: Default parallelism (let OpenACC decide)
  write(*,*) 'Running with DEFAULT parallelism...'
  call cpu_time(start_time)
  !$acc parallel loop
  do i = 1, rows
    do j = 1, cols
      result_default(i,j) = matrix_a(i,j) + matrix_b(i,j) * 2.0
    end do
  end do
  !$acc end parallel loop
  call cpu_time(end_time)
  write(*,'(A,F8.4,A)') 'Default time: ', end_time - start_time, ' seconds'
  
  ! Method 2: Custom parallelism (you decide)
  write(*,*) 'Running with CUSTOM parallelism...'
  call cpu_time(start_time)
  !$acc parallel loop num_gangs(8) num_workers(4) vector_length(32)
  do i = 1, rows
    do j = 1, cols
      result_custom(i,j) = matrix_a(i,j) + matrix_b(i,j) * 2.0
    end do
  end do
  !$acc end parallel loop
  call cpu_time(end_time)
  write(*,'(A,F8.4,A)') 'Custom time: ', end_time - start_time, ' seconds'
  
  ! Verify results are the same
  write(*,*) ''
  write(*,*) 'Verifying results match...'
  do i = 1, min(5, rows)
    do j = 1, min(5, cols)
      if (abs(result_default(i,j) - result_custom(i,j)) > 1e-6) then
        write(*,*) 'ERROR: Results do not match!'
        stop
      end if
    end do
  end do
  write(*,*) '✓ Results match perfectly!'
  
  ! Show sample results
  write(*,*) ''
  write(*,*) 'Sample results:'
  do i = 1, 3
    write(*,'(A,I0,A,F8.2)') 'result(', i, ',1) = ', result_default(i,1)
  end do
  
  write(*,*) ''
  write(*,*) 'Parallelism breakdown:'
  write(*,*) '• 8 gangs = 8 independent teams'
  write(*,*) '• 4 workers per gang = 32 total workers'  
  write(*,*) '• 32 vector length = 1024 parallel operations!'
  write(*,*) ''
  
end program matrix_operation

To compile this code –

nvfortran -acc -o matrix_operation matrix_operation.f90

To execute this code –

./matrix_operation

Sample output –

 Matrix Operation with Different Parallelism Levels
 ==================================================
Matrix size: 1000 × 1000
 
 Initializing matrices...
 Running with DEFAULT parallelism...
Default time:   0.0971 seconds
 Running with CUSTOM parallelism...
Custom time:   0.0050 seconds
 
 Verifying results match...
 ✓ Results match perfectly!
 
 Sample results:
result(1,1) =     4.10
result(2,1) =     6.10
result(3,1) =     8.10
 
 Parallelism breakdown:
 • 8 gangs = 8 independent teams
 • 4 workers per gang = 32 total workers
 • 32 vector length = 1024 parallel operations!

Let us consider another OpenACC code –

program parallelism_experiment
  ! Experiment with different parallelism settings
  
  implicit none
  
  integer, parameter :: n = 100000
  real :: data(n)
  real :: result1(n), result2(n), result3(n)
  integer :: i
  real :: start_time, end_time
  
  write(*,*) 'Parallelism Tuning Experiment'
  write(*,*) '============================='
  write(*,*) ''
  
  ! Initialize data
  do i = 1, n
    data(i) = real(i) * 3.14159 / 1000.0
  end do
  
  ! Experiment 1: Few gangs, many workers
  write(*,*) '1. Few gangs (2), many workers (32):'
  call cpu_time(start_time)
  !$acc parallel loop num_gangs(2) num_workers(32)
  do i = 1, n
    result1(i) = sin(data(i)) * cos(data(i))
  end do
  !$acc end parallel loop
  call cpu_time(end_time)
  write(*,'(A,F8.4,A)') '   Time: ', end_time - start_time, ' seconds'
  
  ! Experiment 2: Many gangs, few workers  
  write(*,*) '2. Many gangs (32), few workers (2):'
  call cpu_time(start_time)
  !$acc parallel loop num_gangs(32) num_workers(2)
  do i = 1, n
    result2(i) = sin(data(i)) * cos(data(i))
  end do
  !$acc end parallel loop
  call cpu_time(end_time)
  write(*,'(A,F8.4,A)') '   Time: ', end_time - start_time, ' seconds'
  
  ! Experiment 3: Balanced approach
  write(*,*) '3. Balanced gangs (8), workers (8):'
  call cpu_time(start_time)
  !$acc parallel loop num_gangs(8) num_workers(8)
  do i = 1, n
    result3(i) = sin(data(i)) * cos(data(i))
  end do
  !$acc end parallel loop
  call cpu_time(end_time)
  write(*,'(A,F8.4,A)') '   Time: ', end_time - start_time, ' seconds'
  
  ! Verify all results are the same
  write(*,*) ''
  write(*,*) 'Verifying all methods give same results...'
  do i = 1, n, 1000  ! Check every 1000th element
    if (abs(result1(i) - result2(i)) > 1e-6 .or. &
        abs(result2(i) - result3(i)) > 1e-6) then
      write(*,*) 'ERROR: Results differ!'
      stop
    end if
  end do
  write(*,*) '✓ All results match!'
  
  ! Show the parallelism breakdown
  write(*,*) ''
  write(*,*) 'Understanding the parallelism:'
  write(*,*) ''
  write(*,*) 'Configuration 1: 2 gangs × 32 workers = 64 total workers'
  write(*,*) '• Each gang handles 50,000 elements'
  write(*,*) '• 32 workers per gang work together'
  write(*,*) ''
  write(*,*) 'Configuration 2: 32 gangs × 2 workers = 64 total workers'  
  write(*,*) '• Each gang handles 3,125 elements'
  write(*,*) '• Only 2 workers per gang'
  write(*,*) ''
  write(*,*) 'Configuration 3: 8 gangs × 8 workers = 64 total workers'
  write(*,*) '• Each gang handles 12,500 elements'
  write(*,*) '• Balanced approach'
  write(*,*) ''
  write(*,*) 'Key insight: Same total workers, different organization!'
  write(*,*) 'Performance can vary based on your GPU architecture.'
  
end program parallelism_experiment

To compile this code –

nvfortran -acc -o parallelism_experiment parallelism_experiment.f90

To execute this code –

./parallelism_experiment

Sample output –

 Parallelism Tuning Experiment
 =============================
 
 1. Few gangs (2), many workers (32):
   Time:   0.0955 seconds
 2. Many gangs (32), few workers (2):
   Time:   0.0010 seconds
 3. Balanced gangs (8), workers (8):
   Time:   0.0010 seconds
 
 Verifying all methods give same results...
 ✓ All results match!
 
 Understanding the parallelism:
 
 Configuration 1: 2 gangs × 32 workers = 64 total workers
 • Each gang handles 50,000 elements
 • 32 workers per gang work together
 
 Configuration 2: 32 gangs × 2 workers = 64 total workers
 • Each gang handles 3,125 elements
 • Only 2 workers per gang
 
 Configuration 3: 8 gangs × 8 workers = 64 total workers
 • Each gang handles 12,500 elements
 • Balanced approach
 
 Key insight: Same total workers, different organization!
 Performance can vary based on your GPU architecture.

Click here to go back to OpenACC Fortran tutorials page.

References

OpenACC Specification : https://www.openacc.org/specification

Learn Parallel Programming