What You’ll Learn Today
Imagine a construction site where you need to organize workers efficiently. You have teams (gangs), workers within each team, and each worker can handle multiple small tasks (vectors). That’s exactly how GPU parallelism works with OpenACC!
The Three-Level Parallelism Hierarchy
GPUs organize work in three levels, like a well-structured company:
GANG Level (Teams): [Team 1] [Team 2] [Team 3] [Team 4]
↓ ↓ ↓ ↓
WORKER Level (Employees): [W1 W2] [W1 W2] [W1 W2] [W1 W2]
↓ ↓ ↓ ↓
VECTOR Level (Tasks): [T T T T] [T T T T] [T T T T] [T T T T]
Each level handles a different amount of work:
- Gangs: Big chunks of work (like different building floors)
- Workers: Medium chunks within each gang (like rooms on a floor)
- Vectors: Small chunks within each worker (like tiles in a room)
Understanding GPU Hardware
Think of your GPU like a large office building:
GPU Building:
┌─────────────────────────────────────────┐
│ Gang 1 Gang 2 Gang 3 │ ← Each floor is a Gang
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │W1 │ W2 │ │W1 │ W2 │ │W1 │ W2 │ │ ← Workers in each Gang
│ │████│████│ │████│████│ │████│████│ │ ← Vector units (cores)
│ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────┘
Controlling Parallelism
You can tell OpenACC exactly how to organize the work:
!$acc parallel loop num_gangs(4) num_workers(8) vector_length(32)
do i = 1, n
array(i) = array(i) * 2
end do
!$acc end parallel loop
This creates:
- 4 gangs (teams working independently)
- 8 workers per gang (32 total workers)
- 32 vector elements per worker (1024 total parallel operations!)
Visual: How Work Gets Distributed
Original Loop: do i = 1, 1024
Gang Distribution (4 gangs):
Gang 1: i = 1 to 256
Gang 2: i = 257 to 512
Gang 3: i = 513 to 768
Gang 4: i = 769 to 1024
Worker Distribution (8 workers per gang):
Gang 1, Worker 1: i = 1 to 32
Gang 1, Worker 2: i = 33 to 64
...
Vector Distribution (32 elements per worker):
Gang 1, Worker 1: processes i=1,2,3,...,32 simultaneously!
When to Use Each Level
Use more GANGS when:
- You have lots of independent work
- Your problem size is very large
- You want maximum parallelism
Use more WORKERS when:
- You need coordination within gangs
- You’re working with shared memory
- Your algorithm has some dependencies
Use larger VECTOR_LENGTH when:
- You’re doing simple mathematical operations
- Your data access is very regular
- You want fine-grained parallelism
Default vs Custom Settings
Default (Let OpenACC decide):
!$acc parallel loop ! OpenACC chooses best settings
Custom (You decide):
!$acc parallel loop num_gangs(8) num_workers(4) vector_length(128)
Think of default as “automatic transmission” and custom as “manual transmission” – both work, but manual gives you more control!
Matrix Operations Example
For a 2D matrix, you might organize work like this:
Matrix(100,100):
Option 1: Gangs handle rows
!$acc parallel loop num_gangs(100)
do i = 1, 100 ! Each gang handles one row
do j = 1, 100 ! Sequential within each row
matrix(i,j) = ...
end do
end do
Option 2: Nested parallelism
!$acc parallel loop gang num_gangs(10)
do i = 1, 100 ! 10 gangs, each handles 10 rows
!$acc loop worker
do j = 1, 100 ! Workers handle columns
matrix(i,j) = ...
end do
end do
Performance Tuning Tips
Start Simple:
!$acc parallel loop ! Let OpenACC decide first
Then Experiment:
!$acc parallel loop num_gangs(4) ! Try different gang counts
!$acc parallel loop num_gangs(8)
!$acc parallel loop num_gangs(16)
Measure Performance:
- Time your code with different settings
- Use what works best for YOUR problem
- Different GPUs prefer different settings
Key Terms to Remember
- Gang: Independent team working on part of the problem
- Worker: Member of a gang, can share some resources
- Vector: Fine-grained parallelism within a worker
- Num_gangs: How many independent teams to create
- Num_workers: How many workers in each team
- Vector_length: How many elements each worker processes simultaneously
Example Code
Let us consider the following OpenACC code –
program matrix_operation
! Demonstrates different parallelism levels with matrix operations
implicit none
integer, parameter :: rows = 1000, cols = 1000
real :: matrix_a(rows, cols)
real :: matrix_b(rows, cols)
real :: result_default(rows, cols)
real :: result_custom(rows, cols)
integer :: i, j
real :: start_time, end_time
write(*,*) 'Matrix Operation with Different Parallelism Levels'
write(*,*) '=================================================='
write(*,'(A,I0,A,I0)') 'Matrix size: ', rows, ' × ', cols
write(*,*) ''
! Initialize matrices
write(*,*) 'Initializing matrices...'
do i = 1, rows
do j = 1, cols
matrix_a(i,j) = real(i) + real(j) * 0.1
matrix_b(i,j) = real(i) * 0.5 + real(j)
end do
end do
! Method 1: Default parallelism (let OpenACC decide)
write(*,*) 'Running with DEFAULT parallelism...'
call cpu_time(start_time)
!$acc parallel loop
do i = 1, rows
do j = 1, cols
result_default(i,j) = matrix_a(i,j) + matrix_b(i,j) * 2.0
end do
end do
!$acc end parallel loop
call cpu_time(end_time)
write(*,'(A,F8.4,A)') 'Default time: ', end_time - start_time, ' seconds'
! Method 2: Custom parallelism (you decide)
write(*,*) 'Running with CUSTOM parallelism...'
call cpu_time(start_time)
!$acc parallel loop num_gangs(8) num_workers(4) vector_length(32)
do i = 1, rows
do j = 1, cols
result_custom(i,j) = matrix_a(i,j) + matrix_b(i,j) * 2.0
end do
end do
!$acc end parallel loop
call cpu_time(end_time)
write(*,'(A,F8.4,A)') 'Custom time: ', end_time - start_time, ' seconds'
! Verify results are the same
write(*,*) ''
write(*,*) 'Verifying results match...'
do i = 1, min(5, rows)
do j = 1, min(5, cols)
if (abs(result_default(i,j) - result_custom(i,j)) > 1e-6) then
write(*,*) 'ERROR: Results do not match!'
stop
end if
end do
end do
write(*,*) '✓ Results match perfectly!'
! Show sample results
write(*,*) ''
write(*,*) 'Sample results:'
do i = 1, 3
write(*,'(A,I0,A,F8.2)') 'result(', i, ',1) = ', result_default(i,1)
end do
write(*,*) ''
write(*,*) 'Parallelism breakdown:'
write(*,*) '• 8 gangs = 8 independent teams'
write(*,*) '• 4 workers per gang = 32 total workers'
write(*,*) '• 32 vector length = 1024 parallel operations!'
write(*,*) ''
end program matrix_operation
To compile this code –
nvfortran -acc -o matrix_operation matrix_operation.f90
To execute this code –
./matrix_operation
Sample output –
Matrix Operation with Different Parallelism Levels
==================================================
Matrix size: 1000 × 1000
Initializing matrices...
Running with DEFAULT parallelism...
Default time: 0.0971 seconds
Running with CUSTOM parallelism...
Custom time: 0.0050 seconds
Verifying results match...
✓ Results match perfectly!
Sample results:
result(1,1) = 4.10
result(2,1) = 6.10
result(3,1) = 8.10
Parallelism breakdown:
• 8 gangs = 8 independent teams
• 4 workers per gang = 32 total workers
• 32 vector length = 1024 parallel operations!
Let us consider another OpenACC code –
program parallelism_experiment
! Experiment with different parallelism settings
implicit none
integer, parameter :: n = 100000
real :: data(n)
real :: result1(n), result2(n), result3(n)
integer :: i
real :: start_time, end_time
write(*,*) 'Parallelism Tuning Experiment'
write(*,*) '============================='
write(*,*) ''
! Initialize data
do i = 1, n
data(i) = real(i) * 3.14159 / 1000.0
end do
! Experiment 1: Few gangs, many workers
write(*,*) '1. Few gangs (2), many workers (32):'
call cpu_time(start_time)
!$acc parallel loop num_gangs(2) num_workers(32)
do i = 1, n
result1(i) = sin(data(i)) * cos(data(i))
end do
!$acc end parallel loop
call cpu_time(end_time)
write(*,'(A,F8.4,A)') ' Time: ', end_time - start_time, ' seconds'
! Experiment 2: Many gangs, few workers
write(*,*) '2. Many gangs (32), few workers (2):'
call cpu_time(start_time)
!$acc parallel loop num_gangs(32) num_workers(2)
do i = 1, n
result2(i) = sin(data(i)) * cos(data(i))
end do
!$acc end parallel loop
call cpu_time(end_time)
write(*,'(A,F8.4,A)') ' Time: ', end_time - start_time, ' seconds'
! Experiment 3: Balanced approach
write(*,*) '3. Balanced gangs (8), workers (8):'
call cpu_time(start_time)
!$acc parallel loop num_gangs(8) num_workers(8)
do i = 1, n
result3(i) = sin(data(i)) * cos(data(i))
end do
!$acc end parallel loop
call cpu_time(end_time)
write(*,'(A,F8.4,A)') ' Time: ', end_time - start_time, ' seconds'
! Verify all results are the same
write(*,*) ''
write(*,*) 'Verifying all methods give same results...'
do i = 1, n, 1000 ! Check every 1000th element
if (abs(result1(i) - result2(i)) > 1e-6 .or. &
abs(result2(i) - result3(i)) > 1e-6) then
write(*,*) 'ERROR: Results differ!'
stop
end if
end do
write(*,*) '✓ All results match!'
! Show the parallelism breakdown
write(*,*) ''
write(*,*) 'Understanding the parallelism:'
write(*,*) ''
write(*,*) 'Configuration 1: 2 gangs × 32 workers = 64 total workers'
write(*,*) '• Each gang handles 50,000 elements'
write(*,*) '• 32 workers per gang work together'
write(*,*) ''
write(*,*) 'Configuration 2: 32 gangs × 2 workers = 64 total workers'
write(*,*) '• Each gang handles 3,125 elements'
write(*,*) '• Only 2 workers per gang'
write(*,*) ''
write(*,*) 'Configuration 3: 8 gangs × 8 workers = 64 total workers'
write(*,*) '• Each gang handles 12,500 elements'
write(*,*) '• Balanced approach'
write(*,*) ''
write(*,*) 'Key insight: Same total workers, different organization!'
write(*,*) 'Performance can vary based on your GPU architecture.'
end program parallelism_experiment
To compile this code –
nvfortran -acc -o parallelism_experiment parallelism_experiment.f90
To execute this code –
./parallelism_experiment
Sample output –
Parallelism Tuning Experiment
=============================
1. Few gangs (2), many workers (32):
Time: 0.0955 seconds
2. Many gangs (32), few workers (2):
Time: 0.0010 seconds
3. Balanced gangs (8), workers (8):
Time: 0.0010 seconds
Verifying all methods give same results...
✓ All results match!
Understanding the parallelism:
Configuration 1: 2 gangs × 32 workers = 64 total workers
• Each gang handles 50,000 elements
• 32 workers per gang work together
Configuration 2: 32 gangs × 2 workers = 64 total workers
• Each gang handles 3,125 elements
• Only 2 workers per gang
Configuration 3: 8 gangs × 8 workers = 64 total workers
• Each gang handles 12,500 elements
• Balanced approach
Key insight: Same total workers, different organization!
Performance can vary based on your GPU architecture.
Click here to go back to OpenACC Fortran tutorials page.
References
- OpenACC Specification : https://www.openacc.org/specification