Multi-dimensional Arrays in OpenACC

Think of multi-dimensional arrays like a bookshelf – in Fortran, books are arranged column by column (column-major), while in C they’re arranged row by row (row-major). Understanding this layout is crucial for GPU performance because accessing data in the wrong order is like reaching across the entire bookshelf instead of taking books from the same shelf.

Learn column-major order implications, master 2D/3D array transfers, understand Fortran memory layout, and see performance implications with visual memory patterns.

The Memory Layout Problem

Multi-dimensional arrays can be stored in memory in different ways:

2D Array A(3,4) with values:
  1  4  7  10
  2  5  8  11  
  3  6  9  12

Column-Major (Fortran): [1,2,3,4,5,6,7,8,9,10,11,12]
Row-Major (C/C++):      [1,4,7,10,2,5,8,11,3,6,9,12]

Visual: Column-Major vs Row-Major Memory Layout

Fortran Column-Major Layout:
Matrix: A(1:3, 1:4)
┌─────┬─────┬─────┬─────┐
│ 1,1 │ 1,2 │ 1,3 │ 1,4 │
├─────┼─────┼─────┼─────┤
│ 2,1 │ 2,2 │ 2,3 │ 2,4 │
├─────┼─────┼─────┼─────┤
│ 3,1 │ 3,2 │ 3,3 │ 3,4 │
└─────┴─────┴─────┴─────┘

Memory Storage (Column by Column):
[A(1,1)][A(2,1)][A(3,1)][A(1,2)][A(2,2)][A(3,2)][A(1,3)][A(2,3)][A(3,3)][A(1,4)][A(2,4)][A(3,4)]
 ←---- Column 1 ----→ ←---- Column 2 ----→ ←---- Column 3 ----→ ←---- Column 4 ----→

Fast Access: A(:,j) - Entire columns (sequential memory)
Slow Access: A(i,:) - Entire rows (scattered memory, stride = rows)

Basic 2D Array Operations

program basic_2d_arrays
  implicit none
  integer, parameter :: rows = 1000, cols = 800
  real :: matrix_a(rows, cols), matrix_b(rows, cols), result(rows, cols)
  integer :: i, j

  ! Initialize matrices
  do j = 1, cols        ! Outer loop on columns (efficient)
    do i = 1, rows      ! Inner loop on rows (contiguous memory)
      matrix_a(i, j) = real(i + j)
      matrix_b(i, j) = real(i * j) * 0.001
    end do
  end do

  ! Efficient 2D processing with OpenACC
  !$acc parallel loop collapse(2) copyin(matrix_a, matrix_b) copyout(result)
  do j = 1, cols
    do i = 1, rows
      result(i, j) = matrix_a(i, j) + matrix_b(i, j) * 2.0
    end do
  end do
  !$acc end parallel loop

  write(*,*) 'Matrix processing complete'
  write(*,*) 'Sample result:', result(500, 400)
end program

Visual: Memory Access Patterns

Efficient Column-Major Access Pattern:
do j = 1, cols     ← Outer loop on columns
  do i = 1, rows   ← Inner loop on rows
    array(i,j) = ... 
  end do
end do

Memory Access: [1,1][2,1][3,1]...[rows,1][1,2][2,2]...[rows,2]...
               Sequential → Sequential → Sequential
               ✓ Fast     ✓ Fast     ✓ Fast

Inefficient Row-Major Access Pattern:
do i = 1, rows     ← Outer loop on rows
  do j = 1, cols   ← Inner loop on columns  
    array(i,j) = ...
  end do
end do

Memory Access: [1,1][1,2][1,3]...[1,cols][2,1][2,2]...[2,cols]...
               Jump → Jump → Jump → Jump
               ✗ Slow  ✗ Slow  ✗ Slow  ✗ Slow

3D Array Processing

program array_3d_processing
  implicit none
  integer, parameter :: nx = 100, ny = 120, nz = 80
  real :: field_3d(nx, ny, nz), processed_3d(nx, ny, nz)
  integer :: i, j, k

  ! Initialize 3D field with column-major friendly loops
  do k = 1, nz
    do j = 1, ny
      do i = 1, nx    ! Innermost loop on first dimension
        field_3d(i, j, k) = real(i + j + k) * 0.01
      end do
    end do
  end do

  ! 3D processing with collapse(3) for maximum parallelism
  !$acc parallel loop collapse(3) copyin(field_3d) copyout(processed_3d)
  do k = 1, nz
    do j = 1, ny
      do i = 1, nx
        processed_3d(i, j, k) = field_3d(i, j, k) * field_3d(i, j, k) + &
                               sin(field_3d(i, j, k))
      end do
    end do
  end do
  !$acc end parallel loop

  write(*,*) '3D processing complete'
end program

Column-wise vs Row-wise Operations

program access_pattern_comparison
  implicit none
  integer, parameter :: n = 2000, m = 1500
  real :: matrix(n, m), column_result(n), row_result(m)
  integer :: i, j
  real :: start_time, end_time

  ! Initialize matrix
  do j = 1, m
    do i = 1, n
      matrix(i, j) = real(i * j) * 0.001
    end do
  end do

  write(*,*) 'Comparing column-wise vs row-wise access patterns'

  ! Column-wise processing (EFFICIENT - follows Fortran layout)
  call cpu_time(start_time)
  !$acc parallel loop collapse(2) copyin(matrix) copyout(column_result)
  do j = 1, m
    do i = 1, n
      column_result(i) = column_result(i) + matrix(i, j)  ! Accumulate across columns
    end do
  end do
  !$acc end parallel loop
  call cpu_time(end_time)

  write(*,*) 'Column-wise processing time:', end_time - start_time

  ! Row-wise processing (LESS EFFICIENT - against Fortran layout)
  call cpu_time(start_time)
  !$acc parallel loop collapse(2) copyin(matrix) copyout(row_result)
  do i = 1, n
    do j = 1, m
      row_result(j) = row_result(j) + matrix(i, j)  ! Accumulate across rows
    end do
  end do
  !$acc end parallel loop
  call cpu_time(end_time)

  write(*,*) 'Row-wise processing time:', end_time - start_time
end program

Array Sections with Multi-dimensional Arrays

program multidim_array_sections
  implicit none
  integer, parameter :: n = 800, m = 600
  real :: large_matrix(n, m), block_result(200, 150)
  real :: column_section(n), row_section(m)
  integer :: i, j

  ! Initialize large matrix
  do j = 1, m
    do i = 1, n
      large_matrix(i, j) = sin(real(i) * 0.01) * cos(real(j) * 0.01)
    end do
  end do

  ! Process a block section (rectangular subregion)
  !$acc parallel loop collapse(2) &
  !$acc copyin(large_matrix(100:299, 200:349)) copyout(block_result)
  do j = 1, 150
    do i = 1, 200
      block_result(i, j) = large_matrix(99 + i, 199 + j) * 2.0 + 1.0
    end do
  end do
  !$acc end parallel loop

  ! Process entire column (EFFICIENT - contiguous memory)
  !$acc parallel loop copyin(large_matrix(:, 300)) copyout(column_section)
  do i = 1, n
    column_section(i) = large_matrix(i, 300) + sqrt(abs(large_matrix(i, 300)))
  end do
  !$acc end parallel loop

  ! Process entire row (LESS EFFICIENT - strided memory)
  !$acc parallel loop copyin(large_matrix(400, :)) copyout(row_section)
  do j = 1, m
    row_section(j) = large_matrix(400, j) * large_matrix(400, j)
  end do
  !$acc end parallel loop

  write(*,*) 'Multi-dimensional array sections processed'
  write(*,*) 'Block result sample:', block_result(100, 75)
  write(*,*) 'Column section sample:', column_section(400)
  write(*,*) 'Row section sample:', row_section(300)
end program

Performance Optimization Techniques

Technique 1: Loop Ordering for Column-Major

! ✓ GOOD: Column-major friendly (fast)
do k = 1, nz
  do j = 1, ny
    do i = 1, nx      ! Innermost loop on first dimension
      array(i, j, k) = computation()
    end do
  end do
end do

! ✗ BAD: Against column-major layout (slow)
do i = 1, nx
  do j = 1, ny
    do k = 1, nz      ! Wrong loop ordering
      array(i, j, k) = computation()
    end do
  end do
end do

Technique 2: Using collapse() Effectively

! For 2D arrays - collapse(2)
!$acc parallel loop collapse(2)
do j = 1, cols
  do i = 1, rows
    matrix(i, j) = computation()
  end do
end do

! For 3D arrays - collapse(3)  
!$acc parallel loop collapse(3)
do k = 1, nz
  do j = 1, ny
    do i = 1, nx
      field(i, j, k) = computation()
    end do
  end do
end do

Visual: 3D Array Memory Layout

3D Array: field(nx, ny, nz) = field(4, 3, 2)

Logical View:
┌───── nz=1 ────┬───── nz=2 ────┐
│ 1,1  1,2  1,3 │ 1,1  1,2  1,3 │
│ 2,1  2,2  2,3 │ 2,1  2,2  2,3 │
│ 3,1  3,2  3,3 │ 3,1  3,2  3,3 │
│ 4,1  4,2  4,3 │ 4,1  4,2  4,3 │
└───────────────┴───────────────┘

Column-Major Memory Layout:
[field(1,1,1)][field(2,1,1)][field(3,1,1)][field(4,1,1)]  ← Column 1, Slice 1
[field(1,2,1)][field(2,2,1)][field(3,2,1)][field(4,2,1)]  ← Column 2, Slice 1  
[field(1,3,1)][field(2,3,1)][field(3,3,1)][field(4,3,1)]  ← Column 3, Slice 1
[field(1,1,2)][field(2,1,2)][field(3,1,2)][field(4,1,2)]  ← Column 1, Slice 2
[field(1,2,2)][field(2,2,2)][field(3,2,2)][field(4,2,2)]  ← Column 2, Slice 2
[field(1,3,2)][field(2,3,2)][field(3,3,2)][field(4,3,2)]  ← Column 3, Slice 2

Optimal Loop Order: k → j → i (follows memory layout)

Data Transfer Optimization

program transfer_optimization
  implicit none
  integer, parameter :: n = 1000, m = 800, p = 600
  real :: matrix_a(n, m), matrix_b(n, m), result_2d(n, m)
  real :: tensor_3d(n, m, p), output_3d(n, m, p)
  integer :: i, j, k

  ! Initialize arrays
  do j = 1, m
    do i = 1, n
      matrix_a(i, j) = real(i + j) * 0.01
      matrix_b(i, j) = real(i - j) * 0.02
    end do
  end do

  do k = 1, p
    do j = 1, m
      do i = 1, n
        tensor_3d(i, j, k) = real(i * j * k) * 0.001
      end do
    end do
  end do

  ! Efficient data region for multiple operations
  !$acc data copyin(matrix_a, matrix_b, tensor_3d) &
  !$acc      copyout(result_2d, output_3d)

    ! 2D matrix operations
    !$acc parallel loop collapse(2)
    do j = 1, m
      do i = 1, n
        result_2d(i, j) = matrix_a(i, j) + matrix_b(i, j) * 2.0
      end do
    end do
    !$acc end parallel loop

    ! 3D tensor processing
    !$acc parallel loop collapse(3) 
    do k = 1, p
      do j = 1, m
        do i = 1, n
          output_3d(i, j, k) = tensor_3d(i, j, k) * result_2d(i, j) + &
                              sin(tensor_3d(i, j, k))
        end do
      end do
    end do
    !$acc end parallel loop

  !$acc end data

  write(*,*) 'Multi-dimensional processing complete'
end program

Memory Coalescing and Performance

Understanding Memory Coalescing

GPU Memory Coalescing with Column-Major Arrays:

Good Coalescing (Column-Major Friendly):
Thread 1: matrix(1, j)  ← Sequential addresses
Thread 2: matrix(2, j)  ← Next sequential address  
Thread 3: matrix(3, j)  ← Next sequential address
Thread 4: matrix(4, j)  ← Next sequential address
Result: Single coalesced memory transaction ✓

Poor Coalescing (Row-Major Pattern):
Thread 1: matrix(i, 1)  ← Address A
Thread 2: matrix(i, 2)  ← Address A + stride  
Thread 3: matrix(i, 3)  ← Address A + 2*stride
Thread 4: matrix(i, 4)  ← Address A + 3*stride  
Result: Multiple scattered memory transactions ✗

Performance Comparison Table

Access Pattern	Memory Layout Fit	GPU Coalescing	Relative Performance
Column sections	Perfect	Excellent	100% (baseline)
Contiguous blocks	Good	Very Good	95-98%
Row sections	Poor	Poor	60-80%
Scattered access	Very Poor	Very Poor	30-50%

Advanced Multi-dimensional Techniques

Technique 1: Tiled Processing

! Process matrix in tiles for better cache usage
integer, parameter :: tile_size = 64
do jt = 1, m, tile_size
  do it = 1, n, tile_size
    !$acc parallel loop collapse(2) &
    !$acc copyin(matrix(it:min(it+tile_size-1,n), jt:min(jt+tile_size-1,m)))
    do j = jt, min(jt + tile_size - 1, m)
      do i = it, min(it + tile_size - 1, n)
        ! Process tile
      end do
    end do
    !$acc end parallel loop
  end do
end do

Technique 2: Dimension-aware Processing

! Choose processing order based on array dimensions
if (nx > ny .and. nx > nz) then
  ! Process along x-dimension (first index) for best performance
  !$acc parallel loop collapse(3)
  do k = 1, nz
    do j = 1, ny  
      do i = 1, nx  ! Innermost on largest dimension
        array(i, j, k) = computation()
      end do
    end do
  end do
end if

Common Multi-dimensional Patterns

Pattern 1: Matrix-Matrix Operations

!$acc parallel loop collapse(2) copyin(A, B) copyout(C)
do j = 1, n
  do i = 1, n
    C(i, j) = A(i, j) + B(i, j)
  end do
end do

Pattern 2: 3D Stencil Operations

!$acc parallel loop collapse(3) copy(field)
do k = 2, nz-1
  do j = 2, ny-1
    do i = 2, nx-1
      field(i, j, k) = 0.8 * field(i, j, k) + 0.2 * ( &
        field(i-1, j, k) + field(i+1, j, k) + &
        field(i, j-1, k) + field(i, j+1, k) + &
        field(i, j, k-1) + field(i, j, k+1)) / 6.0
    end do
  end do
end do

Pattern 3: Dimension Reduction

! Sum along first dimension (efficient)
!$acc parallel loop copyin(matrix) copyout(column_sums)
do j = 1, cols
  column_sums(j) = 0.0
  do i = 1, rows
    column_sums(j) = column_sums(j) + matrix(i, j)
  end do
end do

Best Practices

✅ DO:

Use column-major friendly loop ordering (i innermost)
Leverage collapse() for multi-dimensional arrays
Process contiguous memory sections when possible
Consider array sections for partial processing
Use data regions for multiple multi-dimensional operations

❌ DON’T:

Use row-major loop ordering (i outermost)
Ignore memory layout when designing algorithms
Access non-contiguous sections unnecessarily
Forget about cache and coalescing effects
Mix different access patterns in the same kernel

Quick Reference

! Optimal 2D loop structure:
!$acc parallel loop collapse(2)
do j = 1, cols
  do i = 1, rows     ! Inner loop on first dimension
    matrix(i, j) = computation()
  end do
end do

! Optimal 3D loop structure:
!$acc parallel loop collapse(3)  
do k = 1, nz
  do j = 1, ny
    do i = 1, nx     ! Inner loop on first dimension
      field(i, j, k) = computation()
    end do
  end do
end do

! Array sections:
matrix(start_row:end_row, start_col:end_col)  ! Rectangular block
matrix(:, column_num)                         ! Entire column (fast)
matrix(row_num, :)                           ! Entire row (slower)

Quick Summary

Fortran Memory Layout (Column-Major):
┌─────────────────────────┬────────────────────────────────────────────────┐
│ Array Type              │ Memory Storage Pattern                         │
├─────────────────────────┼────────────────────────────────────────────────┤
│ 2D: matrix(i,j)         │ [col1: (1,1)(2,1)...(n,1)] [col2: (1,2)...]    │
│ 3D: field(i,j,k)        │ [slice1: columns] [slice2: columns] ...        │
│ First index varies fast │ Contiguous elements: A(1,j), A(2,j), A(3,j)    │
│ Last index varies slow  │ Strided elements: A(i,1), A(i,2), A(i,3)       │
└─────────────────────────┴────────────────────────────────────────────────┘

Optimal Loop Ordering for Performance:
┌─────────────────────────┬────────────────────────────────────────────────┐
│ Array Dimensions        │ Optimal Loop Structure                         │
├─────────────────────────┼────────────────────────────────────────────────┤
│ 2D: array(i,j)          │ do j = 1, cols                                 │
│                         │   do i = 1, rows  ! Inner loop on 1st dim      │
│ 3D: array(i,j,k)        │ do k = 1, nz                                   │
│                         │   do j = 1, ny                                 │
│                         │     do i = 1, nx  ! Inner loop on 1st dim      │
│ 4D: array(i,j,k,l)      │ do l = 1, nl                                   │
│                         │   do k = 1, nk                                 │
│                         │     do j = 1, nj                               │
│                         │       do i = 1, ni ! Inner loop on 1st dim     │
└─────────────────────────┴────────────────────────────────────────────────┘

OpenACC Collapse Directive Usage:
┌─────────────────────────┬────────────────────────────────────────────────┐
│ Array Type              │ Recommended OpenACC Directive                  │
├─────────────────────────┼────────────────────────────────────────────────┤
│ 2D Arrays               │ !$acc parallel loop collapse(2)                │
│ 3D Arrays               │ !$acc parallel loop collapse(3)                │
│ 4D+ Arrays              │ !$acc parallel loop collapse(3 or 4)           │
│ Large outer dimensions  │ Collapse outer loops for more parallelism      │
│ Small outer dimensions  │ May collapse fewer dimensions                  │
└─────────────────────────┴────────────────────────────────────────────────┘

Memory Access Performance Comparison:
┌─────────────────────────┬──────────────────┬──────────────────────────────┐
│ Access Pattern          │ Memory Pattern   │ Relative Performance         │
├─────────────────────────┼──────────────────┼──────────────────────────────┤
│ Column access A(:,j)    │ Sequential       │ 100% (optimal)               │
│ Contiguous blocks       │ Sequential       │ 95-98%                       │
│ Small strides           │ Regular pattern  │ 80-90%                       │
│ Row access A(i,:)       │ Strided          │ 60-80%                       │
│ Large strides           │ Scattered        │ 40-60%                       │
│ Random access           │ Unpredictable    │ 30-50%                       │
└─────────────────────────┴──────────────────┴──────────────────────────────┘

Array Section Efficiency:
┌─────────────────────────────────────────────┬──────────────────────────────┐
│ Array Section Type                          │ Efficiency Notes             │
├─────────────────────────────────────────────┼──────────────────────────────┤
│ matrix(start_row:end_row, start_col:end_col) │ Rectangular block - Good    │
│ matrix(:, column_number)                    │ Full column - Excellent      │
│ matrix(row_number, :)                       │ Full row - Moderate          │
│ matrix(1:n:2, 1:m:2)                        │ Strided - Poor               │
│ field(:, :, slice)                          │ 2D slice - Very Good         │
│ field(x, :, :)                              │ Cross-section - Moderate     │
└─────────────────────────────────────────────┴──────────────────────────────┘

GPU Memory Coalescing Impact:
• Column-major access → Coalesced memory transactions → High bandwidth
• Row-major access → Non-coalesced transactions → Reduced bandwidth
• Contiguous access → Fewer memory transactions → Better performance
• Strided access → More memory transactions → Lower performance

Common Multi-dimensional Applications:
┌─────────────────────────────────────────────┬──────────────────────────────┐
│ Application Domain                          │ Typical Array Usage          │
├─────────────────────────────────────────────┼──────────────────────────────┤
│ Image Processing                            │ 2D pixel arrays              │
│ Scientific Simulation                       │ 2D/3D field grids            │
│ Finite Element Analysis                     │ 2D mesh, 3D stress tensors   │
│ Computational Fluid Dynamics                │ 3D velocity/pressure fields  │
│ Climate Modeling                            │ 3D atmospheric grids         │
│ Machine Learning                            │ Multi-dimensional tensors    │
│ Signal Processing                           │ 2D/3D frequency domains      │
└─────────────────────────────────────────────┴──────────────────────────────┘

Best Practices Summary:
✓ Always use column-major friendly loop ordering (i innermost)
✓ Leverage collapse() directive for maximum parallelism
✓ Process contiguous memory sections when possible
✓ Consider cache effects and memory coalescing
✓ Use appropriate array sections for partial processing
✓ Group multi-dimensional operations in data regions

Performance Debugging Tips:
• Profile different loop orderings to measure impact
• Use compiler feedback to verify vectorization/acceleration
• Monitor memory bandwidth utilization
• Test with different array sizes to identify bottlenecks

Example Code

Let us consider the following OpenACC code –

program multidim_demo
  ! Demonstrates OpenACC with multi-dimensional Fortran arrays
  ! Focuses on column-major memory layout and performance implications
  
  implicit none
  
  integer, parameter :: n2d = 800, m2d = 600
  integer, parameter :: n3d = 80, m3d = 60, p3d = 50
  real :: matrix_a(n2d, m2d), matrix_b(n2d, m2d), matrix_result(n2d, m2d)
  real :: field_3d(n3d, m3d, p3d), processed_3d(n3d, m3d, p3d)
  real :: column_section(n2d), row_section(m2d)
  real :: block_result(200, 150)
  real :: test_matrix(400, 500), sum_result
  integer :: i, j, k, test_i, test_j
  real :: start_time, end_time
  
  write(*,*) 'Multi-dimensional Arrays in OpenACC'
  write(*,*) '=================================='
  write(*,*) ''
  
  write(*,*) 'Demonstration 1: 2D Matrix Operations'
  write(*,*) '(Column-major friendly loop ordering)'
  write(*,*) ''
  
  ! Initialize 2D matrices with column-major friendly loops
  write(*,*) 'Initializing 2D matrices...'
  do j = 1, m2d          ! Outer loop on second dimension (columns)
    do i = 1, n2d        ! Inner loop on first dimension (rows) - EFFICIENT
      matrix_a(i, j) = sin(real(i) * 0.01) + cos(real(j) * 0.01)
      matrix_b(i, j) = real(i * j) * 0.0001
    end do
  end do
  
  write(*,'(A,I0,A,I0,A)') '✓ Initialized ', n2d, '×', m2d, ' matrices'
  write(*,'(A,F8.4)') 'Sample A value: ', matrix_a(400, 300)
  write(*,'(A,F8.4)') 'Sample B value: ', matrix_b(400, 300)
  
  ! 2D matrix processing with OpenACC
  write(*,*) ''
  write(*,*) 'Processing 2D matrices on GPU...'
  
  !$acc parallel loop collapse(2) copyin(matrix_a, matrix_b) copyout(matrix_result)
  do j = 1, m2d
    do i = 1, n2d
      matrix_result(i, j) = matrix_a(i, j) * matrix_b(i, j) + &
                           sqrt(abs(matrix_a(i, j))) * 0.5
    end do
  end do
  !$acc end parallel loop
  
  write(*,*) '✓ 2D matrix processing complete'
  write(*,'(A,F8.4)') 'Sample result: ', matrix_result(400, 300)
  write(*,*) '✓ Used collapse(2) for optimal 2D parallelization'
  write(*,*) ''
  
  write(*,*) 'Demonstration 2: 3D Array Processing'
  write(*,*) '(3D field with collapse(3) parallelization)'
  write(*,*) ''
  
  ! Initialize 3D field with optimal loop ordering
  write(*,'(A,I0,A,I0,A,I0,A)') 'Initializing 3D field: ', n3d, '×', m3d, '×', p3d
  do k = 1, p3d          ! Outermost loop on third dimension
    do j = 1, m3d        ! Middle loop on second dimension  
      do i = 1, n3d      ! Inner loop on first dimension - OPTIMAL
        field_3d(i, j, k) = real(i + j + k) * 0.01 + &
                           sin(real(i) * 0.05) * cos(real(j) * 0.05) * sin(real(k) * 0.05)
      end do
    end do
  end do
  
  write(*,*) '✓ 3D field initialized with column-major friendly loops'
  write(*,'(A,F8.4)') 'Sample 3D value: ', field_3d(40, 30, 25)
  
  ! 3D processing with collapse(3)
  write(*,*) ''
  write(*,*) 'Processing 3D field on GPU...'
  
  !$acc parallel loop collapse(3) copyin(field_3d) copyout(processed_3d)
  do k = 1, p3d
    do j = 1, m3d
      do i = 1, n3d
        processed_3d(i, j, k) = field_3d(i, j, k) * field_3d(i, j, k) + &
                               exp(field_3d(i, j, k) * 0.1) * 0.01
      end do
    end do
  end do
  !$acc end parallel loop
  
  write(*,*) '✓ 3D field processing complete'
  write(*,'(A,F8.4)') 'Sample 3D result: ', processed_3d(40, 30, 25)
  write(*,*) '✓ Used collapse(3) for maximum 3D parallelization'
  write(*,*) ''
  
  write(*,*) 'Demonstration 3: Array Sections with Multi-dimensional Arrays'
  write(*,*) '(Comparing different access patterns)'
  write(*,*) ''
  
  ! Process a rectangular block (efficient)
  write(*,*) 'Processing rectangular matrix block...'
  !$acc parallel loop collapse(2) &
  !$acc copyin(matrix_a(200:399, 150:299)) copyout(block_result)
  do j = 1, 150
    do i = 1, 200
      block_result(i, j) = matrix_a(199 + i, 149 + j) * 3.0 + 1.0
    end do
  end do
  !$acc end parallel loop
  
  write(*,*) '✓ Block processing complete (contiguous memory)'
  write(*,'(A,F8.4)') 'Block result sample: ', block_result(100, 75)
  write(*,*) ''
  
  ! Process entire column (EFFICIENT - sequential memory access)
  write(*,*) 'Processing entire column (efficient access)...'
  call cpu_time(start_time)
  
  !$acc parallel loop copyin(matrix_a(:, 400)) copyout(column_section)
  do i = 1, n2d
    column_section(i) = matrix_a(i, 400) + sqrt(abs(matrix_a(i, 400)))
  end do
  !$acc end parallel loop
  
  call cpu_time(end_time)
  write(*,'(A,F6.4,A)') '✓ Column processing time: ', end_time - start_time, ' seconds'
  write(*,'(A,F8.4)') 'Column result sample: ', column_section(400)
  write(*,*) '✓ Sequential memory access (cache-friendly)'
  write(*,*) ''
  
  ! Process entire row (LESS EFFICIENT - strided memory access)
  write(*,*) 'Processing entire row (less efficient access)...'
  call cpu_time(start_time)
  
  !$acc parallel loop copyin(matrix_a(500, :)) copyout(row_section)
  do j = 1, m2d
    row_section(j) = matrix_a(500, j) * matrix_a(500, j)
  end do
  !$acc end parallel loop
  
  call cpu_time(end_time)
  write(*,'(A,F6.4,A)') '✓ Row processing time: ', end_time - start_time, ' seconds'
  write(*,'(A,F8.4)') 'Row result sample: ', row_section(300)
  write(*,*) '⚠ Strided memory access (less cache-friendly)'
  write(*,*) ''
  
  write(*,*) 'Demonstration 4: Memory Access Pattern Comparison'
  write(*,*) '(Column-major vs Row-major loop ordering)'
  write(*,*) ''
  
  ! Demonstrate optimal vs suboptimal loop ordering
  
  ! Initialize test matrix
  do test_j = 1, 500
    do test_i = 1, 400
      test_matrix(test_i, test_j) = real(test_i + test_j) * 0.01
    end do
  end do
  
  ! OPTIMAL: Column-major friendly (j outer, i inner)
  write(*,*) 'Testing OPTIMAL loop ordering (column-major friendly)...'
  call cpu_time(start_time)
  
  sum_result = 0.0
  !$acc parallel loop collapse(2) copyin(test_matrix) reduction(+:sum_result)
  do test_j = 1, 500    ! Outer loop on columns
    do test_i = 1, 400  ! Inner loop on rows (follows memory layout)
      sum_result = sum_result + test_matrix(test_i, test_j)
    end do
  end do
  !$acc end parallel loop
  
  call cpu_time(end_time)
  write(*,'(A,F6.4,A)') '✓ Optimal ordering time: ', end_time - start_time, ' seconds'
  write(*,'(A,F10.2)') 'Sum result: ', sum_result
  write(*,*) '✓ Follows Fortran column-major memory layout'
  write(*,*) ''
  
  ! SUBOPTIMAL: Row-major style (i outer, j inner)  
  write(*,*) 'Testing SUBOPTIMAL loop ordering (row-major style)...'
  call cpu_time(start_time)
  
  sum_result = 0.0
  !$acc parallel loop collapse(2) copyin(test_matrix) reduction(+:sum_result)
  do test_i = 1, 400    ! Outer loop on rows  
    do test_j = 1, 500  ! Inner loop on columns (against memory layout)
      sum_result = sum_result + test_matrix(test_i, test_j)
    end do
  end do
  !$acc end parallel loop
  
  call cpu_time(end_time)
  write(*,'(A,F6.4,A)') '⚠ Suboptimal ordering time: ', end_time - start_time, ' seconds'
  write(*,'(A,F10.2)') 'Sum result: ', sum_result
  write(*,*) '⚠ Against Fortran column-major memory layout'
  write(*,*) ''
  
  write(*,*) 'Multi-dimensional Array Performance Summary:'
  write(*,*) '==========================================='
  write(*,*) 'Memory Layout: Fortran uses Column-Major order'
  write(*,*) '• Elements stored column by column in memory'
  write(*,*) '• First index varies fastest in memory'
  write(*,*) ''
  write(*,*) 'Optimal Access Patterns:'
  write(*,*) '• Inner loop on first dimension (i)'
  write(*,*) '• Outer loops on higher dimensions (j, k)'
  write(*,*) '• Column sections faster than row sections'  
  write(*,*) '• Contiguous blocks more efficient than scattered'
  write(*,*) ''
  write(*,*) 'OpenACC Best Practices:'
  write(*,*) '• Use collapse(2) for 2D arrays'
  write(*,*) '• Use collapse(3) for 3D arrays'
  write(*,*) '• Follow column-major loop ordering'
  write(*,*) '• Process contiguous sections when possible'
  write(*,*) '• Consider memory coalescing on GPU'
  write(*,*) ''
  write(*,*) 'Performance Impact:'
  write(*,*) '• Column-major friendly: 100% performance'
  write(*,*) '• Row-major style: 60-80% performance'
  write(*,*) '• Random access: 30-50% performance'
  write(*,*) ''
  write(*,*) 'Common Use Cases:'
  write(*,*) '• Scientific simulations (2D/3D grids)'
  write(*,*) '• Image processing (2D pixel arrays)'
  write(*,*) '• Finite element methods (mesh arrays)'
  write(*,*) '• Computational fluid dynamics (field arrays)'
  write(*,*) '• Matrix computations (linear algebra)'
  
end program multidim_demo

To compile this code –

nvfortran -acc -Minfo=accel -O2 multidim_demo.f90 -o multidim_demo

To execute this code –

./multidim_demo

Sample output –

  Multi-dimensional Arrays in OpenACC
 ==================================
 
 Demonstration 1: 2D Matrix Operations
 (Column-major friendly loop ordering)
 
 Initializing 2D matrices...
✓ Initialized 800×600 matrices
Sample A value:  -1.7468
Sample B value:  12.0000
 
 Processing 2D matrices on GPU...
 ✓ 2D matrix processing complete
Sample result: -20.3007
 ✓ Used collapse(2) for optimal 2D parallelization
 
 Demonstration 2: 3D Array Processing
 (3D field with collapse(3) parallelization)
 
Initializing 3D field: 80×60×50
 ✓ 3D field initialized with column-major friendly loops
Sample 3D value:   1.0110
 
 Processing 3D field on GPU...
 ✓ 3D field processing complete
Sample 3D result:   1.0333
 ✓ Used collapse(3) for maximum 3D parallelization
 
 Demonstration 3: Array Sections with Multi-dimensional Arrays
 (Comparing different access patterns)
 
 Processing rectangular matrix block...
 ✓ Block processing complete (contiguous memory)
Block result sample:  -0.4080
 
 Processing entire column (efficient access)...
✓ Column processing time: 0.0000 seconds
Column result sample:  -0.2228
 ✓ Sequential memory access (cache-friendly)
 
 Processing entire row (less efficient access)...
✓ Row processing time: 0.0000 seconds
Row result sample:   3.7983
 ⚠ Strided memory access (less cache-friendly)
 
 Demonstration 4: Memory Access Pattern Comparison
 (Column-major vs Row-major loop ordering)
 
 Testing OPTIMAL loop ordering (column-major friendly)...
✓ Optimal ordering time: 0.0004 seconds
Sum result:  902000.00
 ✓ Follows Fortran column-major memory layout
 
 Testing SUBOPTIMAL loop ordering (row-major style)...
⚠ Suboptimal ordering time: 0.0003 seconds
Sum result:  902000.00
 ⚠ Against Fortran column-major memory layout
 
 Multi-dimensional Array Performance Summary:
 ===========================================
 Memory Layout: Fortran uses Column-Major order
 • Elements stored column by column in memory
 • First index varies fastest in memory
 
 Optimal Access Patterns:
 • Inner loop on first dimension (i)
 • Outer loops on higher dimensions (j, k)
 • Column sections faster than row sections
 • Contiguous blocks more efficient than scattered
 
 OpenACC Best Practices:
 • Use collapse(2) for 2D arrays
 • Use collapse(3) for 3D arrays
 • Follow column-major loop ordering
 • Process contiguous sections when possible
 • Consider memory coalescing on GPU
 
 Performance Impact:
 • Column-major friendly: 100% performance
 • Row-major style: 60-80% performance
 • Random access: 30-50% performance
 
 Common Use Cases:
 • Scientific simulations (2D/3D grids)
 • Image processing (2D pixel arrays)
 • Finite element methods (mesh arrays)
 • Computational fluid dynamics (field arrays)
 • Matrix computations (linear algebra)

Click here to go back to OpenACC Fortran tutorials page.

References

OpenACC Specification : https://www.openacc.org/specification

Learn Parallel Programming