Imagine you’re cooking in a kitchen. Sometimes you need temporary bowls for mixing ingredients – you don’t bring them from another room, and you don’t take them back when done. You just need them for the cooking process! That’s exactly what CREATE, PRESENT, and DELETE clauses do for GPU memory.
Learn how to allocate device memory for Fortran arrays without transferring data, manage temporary work arrays for intermediate calculations, and control device memory explicitly.
The Problem: Temporary Arrays
Many scientific calculations need temporary workspace:
Step 1: temp = sqrt(input_data)
Step 2: result = temp * coefficient
Problem: We only need temp
on the GPU, not on the host!
Without proper memory management:
❌ Host → Device: Transfer temp (unnecessary!)
✅ GPU: Calculate with temp
❌ Device → Host: Transfer temp back (wasteful!)
Visual: Device Memory Allocation
Host Memory Device Memory
┌─────────────┐ ┌─────────────┐
│ input_data │────▶│ input_data │ copyin()
│ │ │ │
│ result │◀────│ result │ copyout()
│ │ │ │
│ (no temp) │ ✗ │ temp_array │ create() ← Device only!
│ │ │ │
└─────────────┘ └─────────────┘
Memory saved on host!
No unnecessary transfers!
Key Data Clauses
CREATE Clause
!$acc parallel loop create(temp_array)
- Purpose: Allocate device memory WITHOUT data transfer
- Use: Temporary arrays, intermediate calculations
- Memory: Device only – saves host memory
PRESENT Clause
!$acc parallel loop present(temp_array)
- Purpose: Use data that ALREADY exists on device
- Use: Multi-kernel operations on same data
- Requirement: Data must already be on device
DELETE Clause
!$acc exit data delete(temp_array)
- Purpose: Explicitly remove data from device
- Use: Manual memory cleanup
- Timing: When finished with the data
Memory Management Workflow
CREATE Phase:
┌─────────────────┐
│ Allocate device │
│ memory for temp │
│ (no data copy) │
└─────────────────┘
│
▼
USE Phase (PRESENT):
┌─────────────────┐
│ Use temp in │
│ calculations │
│ (already there) │
└─────────────────┘
│
▼
DELETE Phase:
┌─────────────────┐
│ Free device │
│ memory for temp │
│ (cleanup) │
└─────────────────┘
Fortran Examples
Basic CREATE Pattern
program simple_create
real :: input(1000), output(1000), temp(1000)
!$acc parallel loop copyin(input) copyout(output) create(temp)
do i = 1, 1000
temp(i) = sqrt(input(i)) ! Step 1: Initialize temp
output(i) = temp(i) * 2.0 ! Step 2: Use temp
end do
!$acc end parallel loop
! temp automatically freed here
end program
Multi-Kernel with PRESENT
! Step 1: CREATE workspace
!$acc parallel loop copyin(data) create(workspace)
do i = 1, n
workspace(i) = sin(data(i))
end do
!$acc end parallel loop
! Step 2: PRESENT - reuse workspace
!$acc parallel loop present(workspace) copyout(result)
do i = 1, n
result(i) = workspace(i) * workspace(i)
end do
!$acc end parallel loop
! Step 3: Manual cleanup
!$acc exit data delete(workspace)
Common Usage Patterns
Pattern 1: Single-Kernel Temporary
!$acc parallel loop create(temp_workspace)
do i = 1, n
temp_workspace(i) = expensive_function(input(i))
output(i) = temp_workspace(i) + offset
end do
!$acc end parallel loop
! temp_workspace automatically freed
Pattern 2: Data Region with Workspace
!$acc data copyin(input) copyout(result) create(workspace)
! First calculation
!$acc parallel loop
do i = 1, n
workspace(i) = preprocess(input(i))
end do
!$acc end parallel loop
! Second calculation using workspace
!$acc parallel loop
do i = 1, n
result(i) = postprocess(workspace(i))
end do
!$acc end parallel loop
!$acc end data ! workspace automatically freed
Memory Efficiency Benefits
Method | Host Memory Usage | Device Transfers | Performance |
---|---|---|---|
Without CREATE | High (temp stored) | 2x transfers | Slow |
With CREATE | Low (no temp) | No transfers | Fast |
Error Prevention
❌ Common Mistake: Uninitialized CREATE
!$acc parallel loop create(temp)
do i = 1, n
result(i) = temp(i) + input(i) ! ERROR: temp has garbage!
end do
!$acc end parallel loop
✅ Correct: Initialize Before Use
!$acc parallel loop copyin(input) create(temp)
do i = 1, n
temp(i) = initialize_value(i) ! Initialize first!
result(i) = temp(i) + input(i) ! Then use
end do
!$acc end parallel loop
Advanced: Allocatable Arrays
program allocatable_create
implicit none
integer, parameter :: n = 50000
real, allocatable :: workspace(:)
real :: data(n), result(n)
allocate(workspace(n)) ! Allocate on host
!$acc data copyin(data) copyout(result) create(workspace)
!$acc parallel loop
do i = 1, n
workspace(i) = sqrt(data(i)) ! Initialize workspace
result(i) = workspace(i) * 2.0 ! Use workspace
end do
!$acc end parallel loop
!$acc end data ! workspace freed on device
deallocate(workspace) ! Free host memory
end program
Performance Comparison
Traditional approach (inefficient):
Memory usage: Host temp + Device temp = 2x memory
Transfer cost: Host→Device→Host = 2x transfers
CREATE approach (efficient):
Memory usage: Device temp only = 1x memory
Transfer cost: Only essential data = Optimal
Best Practices
✅ DO:
- Use CREATE for device-only temporary arrays
- Initialize CREATE arrays before reading from them
- Use PRESENT for multi-kernel data reuse
- Combine CREATE with data regions for efficiency
❌ DON’T:
- Use CREATE for data you need on host
- Read from uninitialized CREATE arrays
- Forget that CREATE arrays contain garbage initially
- Use CREATE for permanent data storage
Example Code
Let us consider the following OpenACC code –
program device_memory_management
! Demonstrates CREATE, PRESENT, and DELETE clauses for device memory management
implicit none
integer, parameter :: n = 8000
real :: input_data(n), final_result(n)
real :: workspace(n) ! Temporary array for intermediate calculations
integer :: i
write(*,*) 'Device Memory Management: CREATE, PRESENT, DELETE'
write(*,*) '==============================================='
write(*,*) ''
! Initialize input data
do i = 1, n
input_data(i) = real(i) * 0.001
end do
write(*,*) 'Demonstration 1: CREATE clause'
write(*,*) '(Temporary workspace allocated only on device)'
write(*,*) ''
! Method 1: CREATE clause for temporary workspace
!$acc parallel loop copyin(input_data) copyout(final_result) create(workspace)
do i = 1, n
! Initialize workspace (important: CREATE arrays contain garbage!)
workspace(i) = sqrt(input_data(i)) + sin(input_data(i))
! Use workspace for final calculation
final_result(i) = workspace(i) * 2.0 + 1.0
end do
!$acc end parallel loop
! workspace automatically freed here
write(*,'(A,F8.4)') ' Sample result: ', final_result(1000)
write(*,*) ' ✓ workspace created on device only (no host memory used)'
write(*,*) ' ✓ No unnecessary data transfers'
write(*,*) ' ✓ Automatic cleanup when parallel region ends'
write(*,*) ''
! Reset for next demonstration
final_result = 0.0
write(*,*) 'Demonstration 2: PRESENT clause'
write(*,*) '(Multi-kernel operations with persistent device data)'
write(*,*) ''
! Method 2: Data region with PRESENT clause
!$acc data copyin(input_data) copyout(final_result) create(workspace)
write(*,*) ' Kernel 1: Initialize workspace with PRESENT'
!$acc parallel loop present(input_data, workspace)
do i = 1, n
workspace(i) = input_data(i) * input_data(i) ! Square the input
end do
!$acc end parallel loop
write(*,*) ' Kernel 2: Process workspace with PRESENT'
!$acc parallel loop present(workspace, final_result)
do i = 1, n
final_result(i) = sqrt(workspace(i)) + 0.5 ! Square root plus offset
end do
!$acc end parallel loop
write(*,*) ' Kernel 3: Final processing with PRESENT'
!$acc parallel loop present(workspace, final_result)
do i = 1, n
final_result(i) = final_result(i) + workspace(i) * 0.1
end do
!$acc end parallel loop
!$acc end data ! workspace automatically deleted here
write(*,'(A,F8.4)') ' Sample result: ', final_result(2000)
write(*,*) ' ✓ workspace reused across multiple kernels'
write(*,*) ' ✓ No data transfers between kernels'
write(*,*) ' ✓ Efficient multi-stage calculations'
write(*,*) ''
write(*,*) 'Demonstration 3: Manual DELETE clause'
write(*,*) '(Explicit device memory management)'
write(*,*) ''
! Method 3: Manual memory management with DELETE
write(*,*) ' Creating data on device...'
!$acc enter data copyin(input_data) create(workspace)
write(*,*) ' Processing with manually managed memory...'
!$acc parallel loop present(input_data, workspace)
do i = 1, n
workspace(i) = log(input_data(i) + 1.0)
end do
!$acc end parallel loop
write(*,*) ' Manually freeing workspace with DELETE...'
!$acc exit data delete(workspace)
write(*,*) ' Cleaning up remaining data...'
!$acc exit data delete(input_data)
write(*,*) ' ✓ Explicit control over device memory lifetime'
write(*,*) ' ✓ Manual cleanup with DELETE clause'
write(*,*) ''
write(*,*) 'Memory Management Summary:'
write(*,*) '========================'
write(*,*) 'CREATE: Allocate device memory (no data transfer)'
write(*,*) ' - Use for temporary workspace arrays'
write(*,*) ' - Saves host memory'
write(*,*) ' - Remember to initialize before use!'
write(*,*) ''
write(*,*) 'PRESENT: Use existing device data'
write(*,*) ' - For multi-kernel operations'
write(*,*) ' - Data must already be on device'
write(*,*) ' - Avoids redundant transfers'
write(*,*) ''
write(*,*) 'DELETE: Explicitly free device memory'
write(*,*) ' - For manual memory control'
write(*,*) ' - Use when data lifetime is complex'
write(*,*) ' - Pairs with enter data create()'
write(*,*) ''
write(*,*) 'Performance Benefits:'
write(*,*) '• Reduced memory transfers'
write(*,*) '• Lower host memory usage'
write(*,*) '• Better cache utilization'
write(*,*) '• Optimal for scientific computing'
end program device_memory_management
To compile this code –
nvfortran -acc -Minfo=accel -O2 device_memory_management.f90 -o device_memory_management
To execute this code –
./device_memory_management
Sample output –
Device Memory Management: CREATE, PRESENT, DELETE
===============================================
Demonstration 1: CREATE clause
(Temporary workspace allocated only on device)
Sample result: 4.6829
✓ workspace created on device only (no host memory used)
✓ No unnecessary data transfers
✓ Automatic cleanup when parallel region ends
Demonstration 2: PRESENT clause
(Multi-kernel operations with persistent device data)
Kernel 1: Initialize workspace with PRESENT
Kernel 2: Process workspace with PRESENT
Kernel 3: Final processing with PRESENT
Sample result: 2.9000
✓ workspace reused across multiple kernels
✓ No data transfers between kernels
✓ Efficient multi-stage calculations
Demonstration 3: Manual DELETE clause
(Explicit device memory management)
Creating data on device...
Processing with manually managed memory...
Manually freeing workspace with DELETE...
Cleaning up remaining data...
✓ Explicit control over device memory lifetime
✓ Manual cleanup with DELETE clause
Memory Management Summary:
========================
CREATE: Allocate device memory (no data transfer)
- Use for temporary workspace arrays
- Saves host memory
- Remember to initialize before use!
PRESENT: Use existing device data
- For multi-kernel operations
- Data must already be on device
- Avoids redundant transfers
DELETE: Explicitly free device memory
- For manual memory control
- Use when data lifetime is complex
- Pairs with enter data create()
Performance Benefits:
• Reduced memory transfers
• Lower host memory usage
• Better cache utilization
• Optimal for scientific computing
Click here to go back to OpenACC Fortran tutorials page.
References
- OpenACC Specification : https://www.openacc.org/specification