Data Clauses – Create, Present, Delete

Imagine you’re cooking in a kitchen. Sometimes you need temporary bowls for mixing ingredients – you don’t bring them from another room, and you don’t take them back when done. You just need them for the cooking process! That’s exactly what CREATE, PRESENT, and DELETE clauses do for GPU memory.

Learn how to allocate device memory for Fortran arrays without transferring data, manage temporary work arrays for intermediate calculations, and control device memory explicitly.

The Problem: Temporary Arrays

Many scientific calculations need temporary workspace:

Step 1: temp = sqrt(input_data)
Step 2: result = temp * coefficient

Problem: We only need temp on the GPU, not on the host!

Without proper memory management:

❌ Host → Device: Transfer temp (unnecessary!)
✅ GPU: Calculate with temp
❌ Device → Host: Transfer temp back (wasteful!)

Visual: Device Memory Allocation

Host Memory          Device Memory
┌─────────────┐     ┌─────────────┐
│ input_data  │────▶│ input_data  │ copyin()
│             │     │             │
│ result      │◀────│ result      │ copyout() 
│             │     │             │
│ (no temp)   │  ✗  │ temp_array  │ create() ← Device only!
│             │     │             │
└─────────────┘     └─────────────┘

Memory saved on host!
No unnecessary transfers!

Key Data Clauses

CREATE Clause

!$acc parallel loop create(temp_array)
  • Purpose: Allocate device memory WITHOUT data transfer
  • Use: Temporary arrays, intermediate calculations
  • Memory: Device only – saves host memory

PRESENT Clause

!$acc parallel loop present(temp_array)
  • Purpose: Use data that ALREADY exists on device
  • Use: Multi-kernel operations on same data
  • Requirement: Data must already be on device

DELETE Clause

!$acc exit data delete(temp_array)
  • Purpose: Explicitly remove data from device
  • Use: Manual memory cleanup
  • Timing: When finished with the data

Memory Management Workflow

CREATE Phase:
┌─────────────────┐
│ Allocate device │
│ memory for temp │
│ (no data copy)  │
└─────────────────┘
         │
         ▼
USE Phase (PRESENT):
┌─────────────────┐
│ Use temp in     │
│ calculations    │
│ (already there) │
└─────────────────┘
         │
         ▼
DELETE Phase:
┌─────────────────┐
│ Free device     │
│ memory for temp │
│ (cleanup)       │
└─────────────────┘

Fortran Examples

Basic CREATE Pattern

program simple_create
  real :: input(1000), output(1000), temp(1000)

  !$acc parallel loop copyin(input) copyout(output) create(temp)
  do i = 1, 1000
    temp(i) = sqrt(input(i))        ! Step 1: Initialize temp
    output(i) = temp(i) * 2.0       ! Step 2: Use temp
  end do
  !$acc end parallel loop
  ! temp automatically freed here
end program

Multi-Kernel with PRESENT

! Step 1: CREATE workspace
!$acc parallel loop copyin(data) create(workspace)
do i = 1, n
  workspace(i) = sin(data(i))
end do
!$acc end parallel loop

! Step 2: PRESENT - reuse workspace
!$acc parallel loop present(workspace) copyout(result)
do i = 1, n
  result(i) = workspace(i) * workspace(i)
end do
!$acc end parallel loop

! Step 3: Manual cleanup
!$acc exit data delete(workspace)

Common Usage Patterns

Pattern 1: Single-Kernel Temporary

!$acc parallel loop create(temp_workspace)
do i = 1, n
  temp_workspace(i) = expensive_function(input(i))
  output(i) = temp_workspace(i) + offset
end do
!$acc end parallel loop
! temp_workspace automatically freed

Pattern 2: Data Region with Workspace

!$acc data copyin(input) copyout(result) create(workspace)

  ! First calculation
  !$acc parallel loop
  do i = 1, n
    workspace(i) = preprocess(input(i))
  end do
  !$acc end parallel loop

  ! Second calculation using workspace
  !$acc parallel loop  
  do i = 1, n
    result(i) = postprocess(workspace(i))
  end do
  !$acc end parallel loop

!$acc end data  ! workspace automatically freed

Memory Efficiency Benefits

MethodHost Memory UsageDevice TransfersPerformance
Without CREATEHigh (temp stored)2x transfersSlow
With CREATELow (no temp)No transfersFast

Error Prevention

❌ Common Mistake: Uninitialized CREATE

!$acc parallel loop create(temp)
do i = 1, n
  result(i) = temp(i) + input(i)  ! ERROR: temp has garbage!
end do
!$acc end parallel loop

✅ Correct: Initialize Before Use

!$acc parallel loop copyin(input) create(temp)  
do i = 1, n
  temp(i) = initialize_value(i)   ! Initialize first!
  result(i) = temp(i) + input(i)  ! Then use
end do
!$acc end parallel loop

Advanced: Allocatable Arrays

program allocatable_create
  implicit none
  integer, parameter :: n = 50000
  real, allocatable :: workspace(:)
  real :: data(n), result(n)

  allocate(workspace(n))  ! Allocate on host

  !$acc data copyin(data) copyout(result) create(workspace)
    !$acc parallel loop
    do i = 1, n
      workspace(i) = sqrt(data(i))      ! Initialize workspace
      result(i) = workspace(i) * 2.0    ! Use workspace
    end do
    !$acc end parallel loop
  !$acc end data  ! workspace freed on device

  deallocate(workspace)   ! Free host memory
end program

Performance Comparison

Traditional approach (inefficient):

Memory usage: Host temp + Device temp = 2x memory
Transfer cost: Host→Device→Host = 2x transfers

CREATE approach (efficient):

Memory usage: Device temp only = 1x memory
Transfer cost: Only essential data = Optimal

Best Practices

DO:

  • Use CREATE for device-only temporary arrays
  • Initialize CREATE arrays before reading from them
  • Use PRESENT for multi-kernel data reuse
  • Combine CREATE with data regions for efficiency

DON’T:

  • Use CREATE for data you need on host
  • Read from uninitialized CREATE arrays
  • Forget that CREATE arrays contain garbage initially
  • Use CREATE for permanent data storage

Example Code

Let us consider the following OpenACC code –

program device_memory_management
  ! Demonstrates CREATE, PRESENT, and DELETE clauses for device memory management
  
  implicit none
  
  integer, parameter :: n = 8000
  real :: input_data(n), final_result(n)
  real :: workspace(n)  ! Temporary array for intermediate calculations
  integer :: i
  
  write(*,*) 'Device Memory Management: CREATE, PRESENT, DELETE'
  write(*,*) '==============================================='
  write(*,*) ''
  
  ! Initialize input data
  do i = 1, n
    input_data(i) = real(i) * 0.001
  end do
  
  write(*,*) 'Demonstration 1: CREATE clause'
  write(*,*) '(Temporary workspace allocated only on device)'
  write(*,*) ''
  
  ! Method 1: CREATE clause for temporary workspace
  !$acc parallel loop copyin(input_data) copyout(final_result) create(workspace)
  do i = 1, n
    ! Initialize workspace (important: CREATE arrays contain garbage!)
    workspace(i) = sqrt(input_data(i)) + sin(input_data(i))
    
    ! Use workspace for final calculation
    final_result(i) = workspace(i) * 2.0 + 1.0
  end do
  !$acc end parallel loop
  ! workspace automatically freed here
  
  write(*,'(A,F8.4)') '   Sample result: ', final_result(1000)
  write(*,*) '   ✓ workspace created on device only (no host memory used)'
  write(*,*) '   ✓ No unnecessary data transfers'
  write(*,*) '   ✓ Automatic cleanup when parallel region ends'
  write(*,*) ''
  
  ! Reset for next demonstration
  final_result = 0.0
  
  write(*,*) 'Demonstration 2: PRESENT clause'
  write(*,*) '(Multi-kernel operations with persistent device data)'
  write(*,*) ''
  
  ! Method 2: Data region with PRESENT clause
  !$acc data copyin(input_data) copyout(final_result) create(workspace)
  
    write(*,*) '   Kernel 1: Initialize workspace with PRESENT'
    !$acc parallel loop present(input_data, workspace)
    do i = 1, n
      workspace(i) = input_data(i) * input_data(i)  ! Square the input
    end do
    !$acc end parallel loop
    
    write(*,*) '   Kernel 2: Process workspace with PRESENT'
    !$acc parallel loop present(workspace, final_result)
    do i = 1, n
      final_result(i) = sqrt(workspace(i)) + 0.5   ! Square root plus offset
    end do
    !$acc end parallel loop
    
    write(*,*) '   Kernel 3: Final processing with PRESENT'
    !$acc parallel loop present(workspace, final_result)
    do i = 1, n
      final_result(i) = final_result(i) + workspace(i) * 0.1
    end do
    !$acc end parallel loop
    
  !$acc end data  ! workspace automatically deleted here
  
  write(*,'(A,F8.4)') '   Sample result: ', final_result(2000)
  write(*,*) '   ✓ workspace reused across multiple kernels'
  write(*,*) '   ✓ No data transfers between kernels'
  write(*,*) '   ✓ Efficient multi-stage calculations'
  write(*,*) ''
  
  write(*,*) 'Demonstration 3: Manual DELETE clause'
  write(*,*) '(Explicit device memory management)'
  write(*,*) ''
  
  ! Method 3: Manual memory management with DELETE
  write(*,*) '   Creating data on device...'
  !$acc enter data copyin(input_data) create(workspace)
  
  write(*,*) '   Processing with manually managed memory...'
  !$acc parallel loop present(input_data, workspace)
  do i = 1, n
    workspace(i) = log(input_data(i) + 1.0)
  end do
  !$acc end parallel loop
  
  write(*,*) '   Manually freeing workspace with DELETE...'
  !$acc exit data delete(workspace)
  
  write(*,*) '   Cleaning up remaining data...'
  !$acc exit data delete(input_data)
  
  write(*,*) '   ✓ Explicit control over device memory lifetime'
  write(*,*) '   ✓ Manual cleanup with DELETE clause'
  write(*,*) ''
  
  write(*,*) 'Memory Management Summary:'
  write(*,*) '========================'
  write(*,*) 'CREATE:  Allocate device memory (no data transfer)'
  write(*,*) '         - Use for temporary workspace arrays'
  write(*,*) '         - Saves host memory'
  write(*,*) '         - Remember to initialize before use!'
  write(*,*) ''
  write(*,*) 'PRESENT: Use existing device data'
  write(*,*) '         - For multi-kernel operations'
  write(*,*) '         - Data must already be on device'
  write(*,*) '         - Avoids redundant transfers'
  write(*,*) ''
  write(*,*) 'DELETE:  Explicitly free device memory'
  write(*,*) '         - For manual memory control'
  write(*,*) '         - Use when data lifetime is complex'
  write(*,*) '         - Pairs with enter data create()'
  write(*,*) ''
  write(*,*) 'Performance Benefits:'
  write(*,*) '• Reduced memory transfers'
  write(*,*) '• Lower host memory usage'
  write(*,*) '• Better cache utilization'
  write(*,*) '• Optimal for scientific computing'
  
end program device_memory_management

To compile this code –

nvfortran -acc -Minfo=accel -O2 device_memory_management.f90 -o device_memory_management

To execute this code –

./device_memory_management

Sample output –

 Device Memory Management: CREATE, PRESENT, DELETE
 ===============================================
 
 Demonstration 1: CREATE clause
 (Temporary workspace allocated only on device)
 
   Sample result:   4.6829
    ✓ workspace created on device only (no host memory used)
    ✓ No unnecessary data transfers
    ✓ Automatic cleanup when parallel region ends
 
 Demonstration 2: PRESENT clause
 (Multi-kernel operations with persistent device data)
 
    Kernel 1: Initialize workspace with PRESENT
    Kernel 2: Process workspace with PRESENT
    Kernel 3: Final processing with PRESENT
   Sample result:   2.9000
    ✓ workspace reused across multiple kernels
    ✓ No data transfers between kernels
    ✓ Efficient multi-stage calculations
 
 Demonstration 3: Manual DELETE clause
 (Explicit device memory management)
 
    Creating data on device...
    Processing with manually managed memory...
    Manually freeing workspace with DELETE...
    Cleaning up remaining data...
    ✓ Explicit control over device memory lifetime
    ✓ Manual cleanup with DELETE clause
 
 Memory Management Summary:
 ========================
 CREATE:  Allocate device memory (no data transfer)
          - Use for temporary workspace arrays
          - Saves host memory
          - Remember to initialize before use!
 
 PRESENT: Use existing device data
          - For multi-kernel operations
          - Data must already be on device
          - Avoids redundant transfers
 
 DELETE:  Explicitly free device memory
          - For manual memory control
          - Use when data lifetime is complex
          - Pairs with enter data create()
 
 Performance Benefits:
 • Reduced memory transfers
 • Lower host memory usage
 • Better cache utilization
 • Optimal for scientific computing

Click here to go back to OpenACC Fortran tutorials page.

References