Data Clauses - Copyin, Copyout, Copy with Fortran Arrays

Think of GPU programming like moving to a new house. You need to pack your stuff (data), move it to the new place (GPU), do your work there, and then decide what to bring back. OpenACC data clauses are like moving instructions!

The Three Essential Data Movements

COPYIN: “Pack this and take it TO the new house (GPU)”

CPU Memory      →     GPU Memory
[Input Data]    →    [Input Data]

COPYOUT: “Pack this and bring it BACK from the new house (GPU)”

GPU Memory    →      CPU Memory  
[Results]     →      [Results]

COPY: “Pack this, take it there, AND bring it back”

CPU       ↔          GPU
[Data]    ↔    [Modified Data]

Understanding Array Section Notation

Fortran lets you specify exactly which parts of an array to transfer:

! Transfer entire array
copyin(array(1:n))

! Transfer first half only
copyin(array(1:n/2))

! Transfer every other element
copyin(array(1:n:2))

! Transfer a 2D section
copyin(matrix(1:10, 1:20))

It’s like telling the movers: “Take only the books from shelf 1 to 5, not the whole library!”

Visual: Data Movement with Array Sections

CPU Array: [1][2][3][4][5][6][7][8][9][10]
           └─────────┘        └─────────┘
           Transfer(1:4)     Transfer(7:10)

GPU Memory:     [1][2][3][4]    [7][8][9][10]
                Only the selected sections!

When to Use Each Data Clause

Use COPYIN when:

Array contains input data that won’t change
You only read from the array in your loop
No need to bring the data back

Use COPYOUT when:

Array stores results from GPU computation
Array starts empty or with garbage data
You only write to the array in your loop

Use COPY when:

Array contains data that gets modified
You both read from AND write to the array
Need the modified results back on CPU

Memory Efficiency with Array Sections

Instead of transferring huge arrays entirely, transfer only what you need:

! INEFFICIENT: Transfer entire 1 million element array
copyin(huge_array(1:1000000))

! EFFICIENT: Transfer only the part you use
copyin(huge_array(1:1000))  ! Only first 1000 elements

This is like packing only summer clothes when moving to a beach house – why take winter coats?

Combining Multiple Data Clauses

You can use multiple clauses in one directive:

!$acc parallel loop copyin(input_a(1:n), input_b(1:n)) copyout(result(1:n))
do i = 1, n
  result(i) = input_a(i) + input_b(i)
end do
!$acc end parallel loop

Think of it as giving different instructions to different movers:

Mover 1: “Take input_a and input_b TO the new house”
Mover 2: “Bring result BACK from the new house”

2D Array Sections

For matrices, you can transfer rectangular sections:

! Transfer top-left 50×50 block
copyin(matrix(1:50, 1:50))

! Transfer entire rows 10-20
copyin(matrix(10:20, 1:cols))

! Transfer entire columns 5-15  
copyin(matrix(1:rows, 5:15))

Visual for 2D sections:

Original Matrix (100×100):
┌─────────────────────────┐
│ ████████                │ ← Transfer this block
│ ████████                │   (rows 1:8, cols 1:8)
│ ████████                │
│                         │
│                         │
└─────────────────────────┘

Performance Impact

Good Practice: Transfer only needed data

! Working with first 1000 elements only
copyin(data(1:1000))  ! Fast: small transfer

Bad Practice: Transfer everything unnecessarily

! Working with first 1000 elements but transferring all
copyin(data(1:1000000))  ! Slow: huge unnecessary transfer

Key Concepts to Remember

Copyin: Input data only (CPU → GPU)
Copyout: Output data only (GPU → CPU)
Copy: Input/output data (CPU ↔ GPU)
Array Sections: Transfer only the parts you need
Memory Efficiency: Smaller transfers = faster programs

Example Codes

Let us consider the following OpenACC code –

program vector_addition
  ! Perfect example of copyin and copyout usage
  ! Two input arrays (copyin) + one result array (copyout)
  
  implicit none
  
  integer, parameter :: n = 50000
  real :: vector_a(n)      ! Input array 1
  real :: vector_b(n)      ! Input array 2  
  real :: vector_sum(n)    ! Result array
  integer :: i
  
  write(*,*) 'Vector Addition with Explicit Data Clauses'
  write(*,*) '=========================================='
  write(*,*) ''
  
  ! Initialize input vectors on CPU
  write(*,*) 'Initializing input vectors on CPU...'
  do i = 1, n
    vector_a(i) = real(i) * 2.5
    vector_b(i) = real(i) * 1.8 + 10.0
  end do
  write(*,*) '✓ Input data ready on CPU'
  
  ! Clear result array (optional, but good practice)
  vector_sum = 0.0
  
  write(*,*) ''
  write(*,*) 'Performing vector addition on GPU...'
  write(*,*) 'Data movement plan:'
  write(*,*) '• vector_a(1:n) → GPU (copyin)'
  write(*,*) '• vector_b(1:n) → GPU (copyin)'  
  write(*,*) '• vector_sum(1:n) ← GPU (copyout)'
  
  ! Vector addition with explicit data movement
  !$acc parallel loop copyin(vector_a(1:n), vector_b(1:n)) copyout(vector_sum(1:n))
  do i = 1, n
    vector_sum(i) = vector_a(i) + vector_b(i)
  end do
  !$acc end parallel loop
  
  write(*,*) '✓ GPU computation complete!'
  write(*,*) '✓ Results copied back to CPU automatically!'
  
  ! Verify results
  write(*,*) ''
  write(*,*) 'Verification (first 10 elements):'
  do i = 1, 10
    write(*,'(A,I0,A,F8.2,A,F8.2,A,F8.2)') 'Element ', i, ': ', &
           vector_a(i), ' + ', vector_b(i), ' = ', vector_sum(i)
  end do
  
  ! Manual verification of a few elements
  write(*,*) ''
  write(*,*) 'Manual check:'
  write(*,'(A,F8.2,A,F8.2)') 'vector_a(1) + vector_b(1) = ', &
         vector_a(1), ' + ', vector_b(1), ' = ', vector_a(1) + vector_b(1)
  write(*,'(A,F8.2)') 'GPU result vector_sum(1) = ', vector_sum(1)
  
  if (abs(vector_sum(1) - (vector_a(1) + vector_b(1))) < 1e-6) then
    write(*,*) '✓ Results are correct!'
  else
    write(*,*) '✗ Something went wrong!'
  end if
  
  write(*,*) ''
  write(*,*) 'Key points:'
  write(*,*) '• Input arrays were copied TO GPU (copyin)'
  write(*,*) '• Result array was copied FROM GPU (copyout)'
  write(*,*) '• No unnecessary data transfers!'
  
end program vector_addition

To compile this code –

nvfortran -acc -o vector_addition vector_addition.f90

To execute this code –

./vector_addition

Sample output –

 Vector Addition with Explicit Data Clauses
 ==========================================
 
 Initializing input vectors on CPU...
 ✓ Input data ready on CPU
 
 Performing vector addition on GPU...
 Data movement plan:
 • vector_a(1:n) → GPU (copyin)
 • vector_b(1:n) → GPU (copyin)
 • vector_sum(1:n) ← GPU (copyout)
 ✓ GPU computation complete!
 ✓ Results copied back to CPU automatically!
 
 Verification (first 10 elements):
Element 1:     2.50 +    11.80 =    14.30
Element 2:     5.00 +    13.60 =    18.60
Element 3:     7.50 +    15.40 =    22.90
Element 4:    10.00 +    17.20 =    27.20
Element 5:    12.50 +    19.00 =    31.50
Element 6:    15.00 +    20.80 =    35.80
Element 7:    17.50 +    22.60 =    40.10
Element 8:    20.00 +    24.40 =    44.40
Element 9:    22.50 +    26.20 =    48.70
Element 10:    25.00 +    28.00 =    53.00
 
 Manual check:
vector_a(1) + vector_b(1) =     2.50 +    11.80
 =    14.30
GPU result vector_sum(1) =    14.30
 ✓ Results are correct!
 
 Key points:
 • Input arrays were copied TO GPU (copyin)
 • Result array was copied FROM GPU (copyout)
 • No unnecessary data transfers!

Let us consider another OpenACC code –

program array_sections
  ! Demonstrates efficient data transfer using array sections
  
  implicit none
  
  integer, parameter :: total_size = 100000
  integer, parameter :: work_size = 10000  ! We only work with first 10,000 elements
  real :: large_array(total_size)          ! Large array
  real :: input_section(work_size)         ! Section we actually use
  real :: output_section(work_size)        ! Results
  integer :: i
  
  write(*,*) 'Efficient Data Transfer with Array Sections'
  write(*,*) '==========================================='
  write(*,'(A,I0)') 'Total array size: ', total_size
  write(*,'(A,I0)') 'Working with only: ', work_size, ' elements'
  write(*,*) ''
  
  ! Initialize the large array
  write(*,*) 'Initializing large array...'
  do i = 1, total_size
    large_array(i) = real(i) * 3.14159 / 1000.0
  end do
  
  ! Copy only the section we need for input
  write(*,*) 'Copying working section from large array...'
  do i = 1, work_size
    input_section(i) = large_array(i)
  end do
  
  write(*,*) ''
  write(*,*) 'Method 1: INEFFICIENT - Transfer entire large array'
  write(*,*) '(This is what NOT to do!)'
  
  ! INEFFICIENT: Transfer the entire large array
  !$acc parallel loop copyin(large_array(1:total_size)) copyout(output_section(1:work_size))
  do i = 1, work_size  ! Only use first 10,000 elements!
    output_section(i) = sin(large_array(i)) * 2.0
  end do
  !$acc end parallel loop
  
  write(*,'(A,I0,A)') 'Transferred ', total_size, ' elements (wasteful!)'
  
  ! Reset output for fair comparison
  output_section = 0.0
  
  write(*,*) ''
  write(*,*) 'Method 2: EFFICIENT - Transfer only needed section'
  write(*,*) '(This is the right way!)'
  
  ! EFFICIENT: Transfer only the section we use
  !$acc parallel loop copyin(input_section(1:work_size)) copyout(output_section(1:work_size))
  do i = 1, work_size
    output_section(i) = sin(input_section(i)) * 2.0
  end do
  !$acc end parallel loop
  
  write(*,'(A,I0,A)') 'Transferred ', work_size, ' elements (efficient!)'
  
  ! Show results
  write(*,*) ''
  write(*,*) 'Sample results:'
  do i = 1, 5
    write(*,'(A,I0,A,F10.6,A,F10.6)') 'Element ', i, ': ', &
           input_section(i), ' → ', output_section(i)
  end do
  
  ! Calculate efficiency improvement
  write(*,*) ''
  write(*,*) 'Efficiency Analysis:'
  write(*,'(A,F6.2,A)') 'Inefficient method transferred ', &
         real(total_size) / real(work_size), 'x more data than needed!'
  write(*,'(A,I0,A)') 'Data transfer reduction: ', &
         total_size - work_size, ' fewer elements transferred'
  
  write(*,*) ''
  write(*,*) 'Key lesson: Always transfer only the data you actually use!'
  write(*,*) 'Array sections like array(1:n) are your friend for efficiency.'
  
end program array_sections

To compile this code –

nvfortran -acc -o array_sections array_sections.f90

To execute this code –

./array_sections

Sample output –

 Efficient Data Transfer with Array Sections
 ===========================================
Total array size: 100000
Working with only: 10000
 elements
 
 Initializing large array...
 Copying working section from large array...
 
 Method 1: INEFFICIENT - Transfer entire large array
 (This is what NOT to do!)
Transferred 100000 elements (wasteful!)
 
 Method 2: EFFICIENT - Transfer only needed section
 (This is the right way!)
Transferred 10000 elements (efficient!)
 
 Sample results:
Element 1:   0.003142 →   0.006283
Element 2:   0.006283 →   0.012566
Element 3:   0.009425 →   0.018849
Element 4:   0.012566 →   0.025132
Element 5:   0.015708 →   0.031415
 
 Efficiency Analysis:
Inefficient method transferred  10.00x more data than needed!
Data transfer reduction: 90000 fewer elements transferred
 
 Key lesson: Always transfer only the data you actually use!
 Array sections like array(1:n) are your friend for efficiency.

Let us consider one more OpenACC code –

program copy_clause_example
  ! Demonstrates the 'copy' clause for data that's both input and output
  
  implicit none
  
  integer, parameter :: n = 20
  real :: data_array(n)    ! Array that will be both read and modified
  integer :: i
  
  write(*,*) 'Copy Clause Example - Modify Array In-Place'
  write(*,*) '=========================================='
  write(*,*) ''
  
  ! Initialize array with some data
  write(*,*) 'Initial array values:'
  do i = 1, n
    data_array(i) = real(i) * 5.0
    write(*,'(A,I0,A,F6.1)') 'data_array(', i, ') = ', data_array(i)
  end do
  
  write(*,*) ''
  write(*,*) 'Modifying array on GPU using COPY clause...'
  write(*,*) 'COPY means: send TO GPU, then bring BACK after modification'
  
  ! Use COPY clause because we both READ and WRITE the same array
  !$acc parallel loop copy(data_array(1:n))
  do i = 1, n
    ! Read the current value AND write a new value
    data_array(i) = data_array(i) * 2.0 + 10.0
  end do
  !$acc end parallel loop
  
  write(*,*) '✓ Array modified on GPU and copied back!'
  
  write(*,*) ''
  write(*,*) 'Modified array values:'
  do i = 1, n
    write(*,'(A,I0,A,F6.1)') 'data_array(', i, ') = ', data_array(i)
  end do
  
  write(*,*) ''
  write(*,*) 'What happened step by step:'
  write(*,*) '1. data_array copied FROM CPU TO GPU (input)'
  write(*,*) '2. GPU reads old values and computes new values'
  write(*,*) '3. data_array copied FROM GPU TO CPU (output)'
  write(*,*) ''
  write(*,*) 'This is why we used COPY instead of COPYIN or COPYOUT:'
  write(*,*) '• COPYIN only: array would not come back (lose results!)'
  write(*,*) '• COPYOUT only: array starts empty on GPU (wrong input!)'
  write(*,*) '• COPY: perfect for read-modify-write operations!'
  
  ! Demonstrate the transformation
  write(*,*) ''
  write(*,*) 'Transformation applied: new_value = old_value × 2.0 + 10.0'
  write(*,*) 'Examples:'
  write(*,*) '• 5.0 → 5.0 × 2.0 + 10.0 = 20.0'
  write(*,*) '• 10.0 → 10.0 × 2.0 + 10.0 = 30.0'
  write(*,*) '• 15.0 → 15.0 × 2.0 + 10.0 = 40.0'
  
end program copy_clause_example

To compile this code –

nvfortran -acc -o copy_clause_example copy_clause_example.f90

To execute this code –

./copy_clause_example

Sample output –

 Copy Clause Example - Modify Array In-Place
 ==========================================
 
 Initial array values:
data_array(1) =    5.0
data_array(2) =   10.0
data_array(3) =   15.0
data_array(4) =   20.0
data_array(5) =   25.0
data_array(6) =   30.0
data_array(7) =   35.0
data_array(8) =   40.0
data_array(9) =   45.0
data_array(10) =   50.0
data_array(11) =   55.0
data_array(12) =   60.0
data_array(13) =   65.0
data_array(14) =   70.0
data_array(15) =   75.0
data_array(16) =   80.0
data_array(17) =   85.0
data_array(18) =   90.0
data_array(19) =   95.0
data_array(20) =  100.0
 
 Modifying array on GPU using COPY clause...
 COPY means: send TO GPU, then bring BACK after modification
 ✓ Array modified on GPU and copied back!
 
 Modified array values:
data_array(1) =   20.0
data_array(2) =   30.0
data_array(3) =   40.0
data_array(4) =   50.0
data_array(5) =   60.0
data_array(6) =   70.0
data_array(7) =   80.0
data_array(8) =   90.0
data_array(9) =  100.0
data_array(10) =  110.0
data_array(11) =  120.0
data_array(12) =  130.0
data_array(13) =  140.0
data_array(14) =  150.0
data_array(15) =  160.0
data_array(16) =  170.0
data_array(17) =  180.0
data_array(18) =  190.0
data_array(19) =  200.0
data_array(20) =  210.0
 
 What happened step by step:
 1. data_array copied FROM CPU TO GPU (input)
 2. GPU reads old values and computes new values
 3. data_array copied FROM GPU TO CPU (output)
 
 This is why we used COPY instead of COPYIN or COPYOUT:
 • COPYIN only: array would not come back (lose results!)
 • COPYOUT only: array starts empty on GPU (wrong input!)
 • COPY: perfect for read-modify-write operations!
 
 Transformation applied: new_value = old_value × 2.0 + 10.0
 Examples:
 • 5.0 → 5.0 × 2.0 + 10.0 = 20.0
 • 10.0 → 10.0 × 2.0 + 10.0 = 30.0
 • 15.0 → 15.0 × 2.0 + 10.0 = 40.0

Click here to go back to OpenACC Fortran tutorials page.

References

OpenACC Specification : https://www.openacc.org/specification

Learn Parallel Programming