Think of GPU programming like moving to a new house. You need to pack your stuff (data), move it to the new place (GPU), do your work there, and then decide what to bring back. OpenACC data clauses are like moving instructions!
The Three Essential Data Movements
COPYIN: “Pack this and take it TO the new house (GPU)”
CPU Memory → GPU Memory
[Input Data] → [Input Data]
COPYOUT: “Pack this and bring it BACK from the new house (GPU)”
GPU Memory → CPU Memory
[Results] → [Results]
COPY: “Pack this, take it there, AND bring it back”
CPU ↔ GPU
[Data] ↔ [Modified Data]
Understanding Array Section Notation
Fortran lets you specify exactly which parts of an array to transfer:
! Transfer entire array
copyin(array(1:n))
! Transfer first half only
copyin(array(1:n/2))
! Transfer every other element
copyin(array(1:n:2))
! Transfer a 2D section
copyin(matrix(1:10, 1:20))
It’s like telling the movers: “Take only the books from shelf 1 to 5, not the whole library!”
Visual: Data Movement with Array Sections
CPU Array: [1][2][3][4][5][6][7][8][9][10]
└─────────┘ └─────────┘
Transfer(1:4) Transfer(7:10)
GPU Memory: [1][2][3][4] [7][8][9][10]
Only the selected sections!
When to Use Each Data Clause
Use COPYIN when:
- Array contains input data that won’t change
- You only read from the array in your loop
- No need to bring the data back
Use COPYOUT when:
- Array stores results from GPU computation
- Array starts empty or with garbage data
- You only write to the array in your loop
Use COPY when:
- Array contains data that gets modified
- You both read from AND write to the array
- Need the modified results back on CPU
Memory Efficiency with Array Sections
Instead of transferring huge arrays entirely, transfer only what you need:
! INEFFICIENT: Transfer entire 1 million element array
copyin(huge_array(1:1000000))
! EFFICIENT: Transfer only the part you use
copyin(huge_array(1:1000)) ! Only first 1000 elements
This is like packing only summer clothes when moving to a beach house – why take winter coats?
Combining Multiple Data Clauses
You can use multiple clauses in one directive:
!$acc parallel loop copyin(input_a(1:n), input_b(1:n)) copyout(result(1:n))
do i = 1, n
result(i) = input_a(i) + input_b(i)
end do
!$acc end parallel loop
Think of it as giving different instructions to different movers:
- Mover 1: “Take input_a and input_b TO the new house”
- Mover 2: “Bring result BACK from the new house”
2D Array Sections
For matrices, you can transfer rectangular sections:
! Transfer top-left 50×50 block
copyin(matrix(1:50, 1:50))
! Transfer entire rows 10-20
copyin(matrix(10:20, 1:cols))
! Transfer entire columns 5-15
copyin(matrix(1:rows, 5:15))
Visual for 2D sections:
Original Matrix (100×100):
┌─────────────────────────┐
│ ████████ │ ← Transfer this block
│ ████████ │ (rows 1:8, cols 1:8)
│ ████████ │
│ │
│ │
└─────────────────────────┘
Performance Impact
Good Practice: Transfer only needed data
! Working with first 1000 elements only
copyin(data(1:1000)) ! Fast: small transfer
Bad Practice: Transfer everything unnecessarily
! Working with first 1000 elements but transferring all
copyin(data(1:1000000)) ! Slow: huge unnecessary transfer
Key Concepts to Remember
- Copyin: Input data only (CPU → GPU)
- Copyout: Output data only (GPU → CPU)
- Copy: Input/output data (CPU ↔ GPU)
- Array Sections: Transfer only the parts you need
- Memory Efficiency: Smaller transfers = faster programs
Example Codes
Let us consider the following OpenACC code –
program vector_addition
! Perfect example of copyin and copyout usage
! Two input arrays (copyin) + one result array (copyout)
implicit none
integer, parameter :: n = 50000
real :: vector_a(n) ! Input array 1
real :: vector_b(n) ! Input array 2
real :: vector_sum(n) ! Result array
integer :: i
write(*,*) 'Vector Addition with Explicit Data Clauses'
write(*,*) '=========================================='
write(*,*) ''
! Initialize input vectors on CPU
write(*,*) 'Initializing input vectors on CPU...'
do i = 1, n
vector_a(i) = real(i) * 2.5
vector_b(i) = real(i) * 1.8 + 10.0
end do
write(*,*) '✓ Input data ready on CPU'
! Clear result array (optional, but good practice)
vector_sum = 0.0
write(*,*) ''
write(*,*) 'Performing vector addition on GPU...'
write(*,*) 'Data movement plan:'
write(*,*) '• vector_a(1:n) → GPU (copyin)'
write(*,*) '• vector_b(1:n) → GPU (copyin)'
write(*,*) '• vector_sum(1:n) ← GPU (copyout)'
! Vector addition with explicit data movement
!$acc parallel loop copyin(vector_a(1:n), vector_b(1:n)) copyout(vector_sum(1:n))
do i = 1, n
vector_sum(i) = vector_a(i) + vector_b(i)
end do
!$acc end parallel loop
write(*,*) '✓ GPU computation complete!'
write(*,*) '✓ Results copied back to CPU automatically!'
! Verify results
write(*,*) ''
write(*,*) 'Verification (first 10 elements):'
do i = 1, 10
write(*,'(A,I0,A,F8.2,A,F8.2,A,F8.2)') 'Element ', i, ': ', &
vector_a(i), ' + ', vector_b(i), ' = ', vector_sum(i)
end do
! Manual verification of a few elements
write(*,*) ''
write(*,*) 'Manual check:'
write(*,'(A,F8.2,A,F8.2)') 'vector_a(1) + vector_b(1) = ', &
vector_a(1), ' + ', vector_b(1), ' = ', vector_a(1) + vector_b(1)
write(*,'(A,F8.2)') 'GPU result vector_sum(1) = ', vector_sum(1)
if (abs(vector_sum(1) - (vector_a(1) + vector_b(1))) < 1e-6) then
write(*,*) '✓ Results are correct!'
else
write(*,*) '✗ Something went wrong!'
end if
write(*,*) ''
write(*,*) 'Key points:'
write(*,*) '• Input arrays were copied TO GPU (copyin)'
write(*,*) '• Result array was copied FROM GPU (copyout)'
write(*,*) '• No unnecessary data transfers!'
end program vector_addition
To compile this code –
nvfortran -acc -o vector_addition vector_addition.f90
To execute this code –
./vector_addition
Sample output –
Vector Addition with Explicit Data Clauses
==========================================
Initializing input vectors on CPU...
✓ Input data ready on CPU
Performing vector addition on GPU...
Data movement plan:
• vector_a(1:n) → GPU (copyin)
• vector_b(1:n) → GPU (copyin)
• vector_sum(1:n) ← GPU (copyout)
✓ GPU computation complete!
✓ Results copied back to CPU automatically!
Verification (first 10 elements):
Element 1: 2.50 + 11.80 = 14.30
Element 2: 5.00 + 13.60 = 18.60
Element 3: 7.50 + 15.40 = 22.90
Element 4: 10.00 + 17.20 = 27.20
Element 5: 12.50 + 19.00 = 31.50
Element 6: 15.00 + 20.80 = 35.80
Element 7: 17.50 + 22.60 = 40.10
Element 8: 20.00 + 24.40 = 44.40
Element 9: 22.50 + 26.20 = 48.70
Element 10: 25.00 + 28.00 = 53.00
Manual check:
vector_a(1) + vector_b(1) = 2.50 + 11.80
= 14.30
GPU result vector_sum(1) = 14.30
✓ Results are correct!
Key points:
• Input arrays were copied TO GPU (copyin)
• Result array was copied FROM GPU (copyout)
• No unnecessary data transfers!
Let us consider another OpenACC code –
program array_sections
! Demonstrates efficient data transfer using array sections
implicit none
integer, parameter :: total_size = 100000
integer, parameter :: work_size = 10000 ! We only work with first 10,000 elements
real :: large_array(total_size) ! Large array
real :: input_section(work_size) ! Section we actually use
real :: output_section(work_size) ! Results
integer :: i
write(*,*) 'Efficient Data Transfer with Array Sections'
write(*,*) '==========================================='
write(*,'(A,I0)') 'Total array size: ', total_size
write(*,'(A,I0)') 'Working with only: ', work_size, ' elements'
write(*,*) ''
! Initialize the large array
write(*,*) 'Initializing large array...'
do i = 1, total_size
large_array(i) = real(i) * 3.14159 / 1000.0
end do
! Copy only the section we need for input
write(*,*) 'Copying working section from large array...'
do i = 1, work_size
input_section(i) = large_array(i)
end do
write(*,*) ''
write(*,*) 'Method 1: INEFFICIENT - Transfer entire large array'
write(*,*) '(This is what NOT to do!)'
! INEFFICIENT: Transfer the entire large array
!$acc parallel loop copyin(large_array(1:total_size)) copyout(output_section(1:work_size))
do i = 1, work_size ! Only use first 10,000 elements!
output_section(i) = sin(large_array(i)) * 2.0
end do
!$acc end parallel loop
write(*,'(A,I0,A)') 'Transferred ', total_size, ' elements (wasteful!)'
! Reset output for fair comparison
output_section = 0.0
write(*,*) ''
write(*,*) 'Method 2: EFFICIENT - Transfer only needed section'
write(*,*) '(This is the right way!)'
! EFFICIENT: Transfer only the section we use
!$acc parallel loop copyin(input_section(1:work_size)) copyout(output_section(1:work_size))
do i = 1, work_size
output_section(i) = sin(input_section(i)) * 2.0
end do
!$acc end parallel loop
write(*,'(A,I0,A)') 'Transferred ', work_size, ' elements (efficient!)'
! Show results
write(*,*) ''
write(*,*) 'Sample results:'
do i = 1, 5
write(*,'(A,I0,A,F10.6,A,F10.6)') 'Element ', i, ': ', &
input_section(i), ' → ', output_section(i)
end do
! Calculate efficiency improvement
write(*,*) ''
write(*,*) 'Efficiency Analysis:'
write(*,'(A,F6.2,A)') 'Inefficient method transferred ', &
real(total_size) / real(work_size), 'x more data than needed!'
write(*,'(A,I0,A)') 'Data transfer reduction: ', &
total_size - work_size, ' fewer elements transferred'
write(*,*) ''
write(*,*) 'Key lesson: Always transfer only the data you actually use!'
write(*,*) 'Array sections like array(1:n) are your friend for efficiency.'
end program array_sections
To compile this code –
nvfortran -acc -o array_sections array_sections.f90
To execute this code –
./array_sections
Sample output –
Efficient Data Transfer with Array Sections
===========================================
Total array size: 100000
Working with only: 10000
elements
Initializing large array...
Copying working section from large array...
Method 1: INEFFICIENT - Transfer entire large array
(This is what NOT to do!)
Transferred 100000 elements (wasteful!)
Method 2: EFFICIENT - Transfer only needed section
(This is the right way!)
Transferred 10000 elements (efficient!)
Sample results:
Element 1: 0.003142 → 0.006283
Element 2: 0.006283 → 0.012566
Element 3: 0.009425 → 0.018849
Element 4: 0.012566 → 0.025132
Element 5: 0.015708 → 0.031415
Efficiency Analysis:
Inefficient method transferred 10.00x more data than needed!
Data transfer reduction: 90000 fewer elements transferred
Key lesson: Always transfer only the data you actually use!
Array sections like array(1:n) are your friend for efficiency.
Let us consider one more OpenACC code –
program copy_clause_example
! Demonstrates the 'copy' clause for data that's both input and output
implicit none
integer, parameter :: n = 20
real :: data_array(n) ! Array that will be both read and modified
integer :: i
write(*,*) 'Copy Clause Example - Modify Array In-Place'
write(*,*) '=========================================='
write(*,*) ''
! Initialize array with some data
write(*,*) 'Initial array values:'
do i = 1, n
data_array(i) = real(i) * 5.0
write(*,'(A,I0,A,F6.1)') 'data_array(', i, ') = ', data_array(i)
end do
write(*,*) ''
write(*,*) 'Modifying array on GPU using COPY clause...'
write(*,*) 'COPY means: send TO GPU, then bring BACK after modification'
! Use COPY clause because we both READ and WRITE the same array
!$acc parallel loop copy(data_array(1:n))
do i = 1, n
! Read the current value AND write a new value
data_array(i) = data_array(i) * 2.0 + 10.0
end do
!$acc end parallel loop
write(*,*) '✓ Array modified on GPU and copied back!'
write(*,*) ''
write(*,*) 'Modified array values:'
do i = 1, n
write(*,'(A,I0,A,F6.1)') 'data_array(', i, ') = ', data_array(i)
end do
write(*,*) ''
write(*,*) 'What happened step by step:'
write(*,*) '1. data_array copied FROM CPU TO GPU (input)'
write(*,*) '2. GPU reads old values and computes new values'
write(*,*) '3. data_array copied FROM GPU TO CPU (output)'
write(*,*) ''
write(*,*) 'This is why we used COPY instead of COPYIN or COPYOUT:'
write(*,*) '• COPYIN only: array would not come back (lose results!)'
write(*,*) '• COPYOUT only: array starts empty on GPU (wrong input!)'
write(*,*) '• COPY: perfect for read-modify-write operations!'
! Demonstrate the transformation
write(*,*) ''
write(*,*) 'Transformation applied: new_value = old_value × 2.0 + 10.0'
write(*,*) 'Examples:'
write(*,*) '• 5.0 → 5.0 × 2.0 + 10.0 = 20.0'
write(*,*) '• 10.0 → 10.0 × 2.0 + 10.0 = 30.0'
write(*,*) '• 15.0 → 15.0 × 2.0 + 10.0 = 40.0'
end program copy_clause_example
To compile this code –
nvfortran -acc -o copy_clause_example copy_clause_example.f90
To execute this code –
./copy_clause_example
Sample output –
Copy Clause Example - Modify Array In-Place
==========================================
Initial array values:
data_array(1) = 5.0
data_array(2) = 10.0
data_array(3) = 15.0
data_array(4) = 20.0
data_array(5) = 25.0
data_array(6) = 30.0
data_array(7) = 35.0
data_array(8) = 40.0
data_array(9) = 45.0
data_array(10) = 50.0
data_array(11) = 55.0
data_array(12) = 60.0
data_array(13) = 65.0
data_array(14) = 70.0
data_array(15) = 75.0
data_array(16) = 80.0
data_array(17) = 85.0
data_array(18) = 90.0
data_array(19) = 95.0
data_array(20) = 100.0
Modifying array on GPU using COPY clause...
COPY means: send TO GPU, then bring BACK after modification
✓ Array modified on GPU and copied back!
Modified array values:
data_array(1) = 20.0
data_array(2) = 30.0
data_array(3) = 40.0
data_array(4) = 50.0
data_array(5) = 60.0
data_array(6) = 70.0
data_array(7) = 80.0
data_array(8) = 90.0
data_array(9) = 100.0
data_array(10) = 110.0
data_array(11) = 120.0
data_array(12) = 130.0
data_array(13) = 140.0
data_array(14) = 150.0
data_array(15) = 160.0
data_array(16) = 170.0
data_array(17) = 180.0
data_array(18) = 190.0
data_array(19) = 200.0
data_array(20) = 210.0
What happened step by step:
1. data_array copied FROM CPU TO GPU (input)
2. GPU reads old values and computes new values
3. data_array copied FROM GPU TO CPU (output)
This is why we used COPY instead of COPYIN or COPYOUT:
• COPYIN only: array would not come back (lose results!)
• COPYOUT only: array starts empty on GPU (wrong input!)
• COPY: perfect for read-modify-write operations!
Transformation applied: new_value = old_value × 2.0 + 10.0
Examples:
• 5.0 → 5.0 × 2.0 + 10.0 = 20.0
• 10.0 → 10.0 × 2.0 + 10.0 = 30.0
• 15.0 → 15.0 × 2.0 + 10.0 = 40.0
Click here to go back to OpenACC Fortran tutorials page.
References
- OpenACC Specification : https://www.openacc.org/specification