Think of multi-dimensional arrays like a bookshelf – in Fortran, books are arranged column by column (column-major), while in C they’re arranged row by row (row-major). Understanding this layout is crucial for GPU performance because accessing data in the wrong order is like reaching across the entire bookshelf instead of taking books from the same shelf.
Learn column-major order implications, master 2D/3D array transfers, understand Fortran memory layout, and see performance implications with visual memory patterns.
The Memory Layout Problem
Multi-dimensional arrays can be stored in memory in different ways:
2D Array A(3,4) with values:
1 4 7 10
2 5 8 11
3 6 9 12
Column-Major (Fortran): [1,2,3,4,5,6,7,8,9,10,11,12]
Row-Major (C/C++): [1,4,7,10,2,5,8,11,3,6,9,12]
Visual: Column-Major vs Row-Major Memory Layout
Fortran Column-Major Layout:
Matrix: A(1:3, 1:4)
┌─────┬─────┬─────┬─────┐
│ 1,1 │ 1,2 │ 1,3 │ 1,4 │
├─────┼─────┼─────┼─────┤
│ 2,1 │ 2,2 │ 2,3 │ 2,4 │
├─────┼─────┼─────┼─────┤
│ 3,1 │ 3,2 │ 3,3 │ 3,4 │
└─────┴─────┴─────┴─────┘
Memory Storage (Column by Column):
[A(1,1)][A(2,1)][A(3,1)][A(1,2)][A(2,2)][A(3,2)][A(1,3)][A(2,3)][A(3,3)][A(1,4)][A(2,4)][A(3,4)]
←---- Column 1 ----→ ←---- Column 2 ----→ ←---- Column 3 ----→ ←---- Column 4 ----→
Fast Access: A(:,j) - Entire columns (sequential memory)
Slow Access: A(i,:) - Entire rows (scattered memory, stride = rows)
Basic 2D Array Operations
program basic_2d_arrays
implicit none
integer, parameter :: rows = 1000, cols = 800
real :: matrix_a(rows, cols), matrix_b(rows, cols), result(rows, cols)
integer :: i, j
! Initialize matrices
do j = 1, cols ! Outer loop on columns (efficient)
do i = 1, rows ! Inner loop on rows (contiguous memory)
matrix_a(i, j) = real(i + j)
matrix_b(i, j) = real(i * j) * 0.001
end do
end do
! Efficient 2D processing with OpenACC
!$acc parallel loop collapse(2) copyin(matrix_a, matrix_b) copyout(result)
do j = 1, cols
do i = 1, rows
result(i, j) = matrix_a(i, j) + matrix_b(i, j) * 2.0
end do
end do
!$acc end parallel loop
write(*,*) 'Matrix processing complete'
write(*,*) 'Sample result:', result(500, 400)
end program
Visual: Memory Access Patterns
Efficient Column-Major Access Pattern:
do j = 1, cols ← Outer loop on columns
do i = 1, rows ← Inner loop on rows
array(i,j) = ...
end do
end do
Memory Access: [1,1][2,1][3,1]...[rows,1][1,2][2,2]...[rows,2]...
Sequential → Sequential → Sequential
✓ Fast ✓ Fast ✓ Fast
Inefficient Row-Major Access Pattern:
do i = 1, rows ← Outer loop on rows
do j = 1, cols ← Inner loop on columns
array(i,j) = ...
end do
end do
Memory Access: [1,1][1,2][1,3]...[1,cols][2,1][2,2]...[2,cols]...
Jump → Jump → Jump → Jump
✗ Slow ✗ Slow ✗ Slow ✗ Slow
3D Array Processing
program array_3d_processing
implicit none
integer, parameter :: nx = 100, ny = 120, nz = 80
real :: field_3d(nx, ny, nz), processed_3d(nx, ny, nz)
integer :: i, j, k
! Initialize 3D field with column-major friendly loops
do k = 1, nz
do j = 1, ny
do i = 1, nx ! Innermost loop on first dimension
field_3d(i, j, k) = real(i + j + k) * 0.01
end do
end do
end do
! 3D processing with collapse(3) for maximum parallelism
!$acc parallel loop collapse(3) copyin(field_3d) copyout(processed_3d)
do k = 1, nz
do j = 1, ny
do i = 1, nx
processed_3d(i, j, k) = field_3d(i, j, k) * field_3d(i, j, k) + &
sin(field_3d(i, j, k))
end do
end do
end do
!$acc end parallel loop
write(*,*) '3D processing complete'
end program
Column-wise vs Row-wise Operations
program access_pattern_comparison
implicit none
integer, parameter :: n = 2000, m = 1500
real :: matrix(n, m), column_result(n), row_result(m)
integer :: i, j
real :: start_time, end_time
! Initialize matrix
do j = 1, m
do i = 1, n
matrix(i, j) = real(i * j) * 0.001
end do
end do
write(*,*) 'Comparing column-wise vs row-wise access patterns'
! Column-wise processing (EFFICIENT - follows Fortran layout)
call cpu_time(start_time)
!$acc parallel loop collapse(2) copyin(matrix) copyout(column_result)
do j = 1, m
do i = 1, n
column_result(i) = column_result(i) + matrix(i, j) ! Accumulate across columns
end do
end do
!$acc end parallel loop
call cpu_time(end_time)
write(*,*) 'Column-wise processing time:', end_time - start_time
! Row-wise processing (LESS EFFICIENT - against Fortran layout)
call cpu_time(start_time)
!$acc parallel loop collapse(2) copyin(matrix) copyout(row_result)
do i = 1, n
do j = 1, m
row_result(j) = row_result(j) + matrix(i, j) ! Accumulate across rows
end do
end do
!$acc end parallel loop
call cpu_time(end_time)
write(*,*) 'Row-wise processing time:', end_time - start_time
end program
Array Sections with Multi-dimensional Arrays
program multidim_array_sections
implicit none
integer, parameter :: n = 800, m = 600
real :: large_matrix(n, m), block_result(200, 150)
real :: column_section(n), row_section(m)
integer :: i, j
! Initialize large matrix
do j = 1, m
do i = 1, n
large_matrix(i, j) = sin(real(i) * 0.01) * cos(real(j) * 0.01)
end do
end do
! Process a block section (rectangular subregion)
!$acc parallel loop collapse(2) &
!$acc copyin(large_matrix(100:299, 200:349)) copyout(block_result)
do j = 1, 150
do i = 1, 200
block_result(i, j) = large_matrix(99 + i, 199 + j) * 2.0 + 1.0
end do
end do
!$acc end parallel loop
! Process entire column (EFFICIENT - contiguous memory)
!$acc parallel loop copyin(large_matrix(:, 300)) copyout(column_section)
do i = 1, n
column_section(i) = large_matrix(i, 300) + sqrt(abs(large_matrix(i, 300)))
end do
!$acc end parallel loop
! Process entire row (LESS EFFICIENT - strided memory)
!$acc parallel loop copyin(large_matrix(400, :)) copyout(row_section)
do j = 1, m
row_section(j) = large_matrix(400, j) * large_matrix(400, j)
end do
!$acc end parallel loop
write(*,*) 'Multi-dimensional array sections processed'
write(*,*) 'Block result sample:', block_result(100, 75)
write(*,*) 'Column section sample:', column_section(400)
write(*,*) 'Row section sample:', row_section(300)
end program
Performance Optimization Techniques
Technique 1: Loop Ordering for Column-Major
! ✓ GOOD: Column-major friendly (fast)
do k = 1, nz
do j = 1, ny
do i = 1, nx ! Innermost loop on first dimension
array(i, j, k) = computation()
end do
end do
end do
! ✗ BAD: Against column-major layout (slow)
do i = 1, nx
do j = 1, ny
do k = 1, nz ! Wrong loop ordering
array(i, j, k) = computation()
end do
end do
end do
Technique 2: Using collapse() Effectively
! For 2D arrays - collapse(2)
!$acc parallel loop collapse(2)
do j = 1, cols
do i = 1, rows
matrix(i, j) = computation()
end do
end do
! For 3D arrays - collapse(3)
!$acc parallel loop collapse(3)
do k = 1, nz
do j = 1, ny
do i = 1, nx
field(i, j, k) = computation()
end do
end do
end do
Visual: 3D Array Memory Layout
3D Array: field(nx, ny, nz) = field(4, 3, 2)
Logical View:
┌───── nz=1 ────┬───── nz=2 ────┐
│ 1,1 1,2 1,3 │ 1,1 1,2 1,3 │
│ 2,1 2,2 2,3 │ 2,1 2,2 2,3 │
│ 3,1 3,2 3,3 │ 3,1 3,2 3,3 │
│ 4,1 4,2 4,3 │ 4,1 4,2 4,3 │
└───────────────┴───────────────┘
Column-Major Memory Layout:
[field(1,1,1)][field(2,1,1)][field(3,1,1)][field(4,1,1)] ← Column 1, Slice 1
[field(1,2,1)][field(2,2,1)][field(3,2,1)][field(4,2,1)] ← Column 2, Slice 1
[field(1,3,1)][field(2,3,1)][field(3,3,1)][field(4,3,1)] ← Column 3, Slice 1
[field(1,1,2)][field(2,1,2)][field(3,1,2)][field(4,1,2)] ← Column 1, Slice 2
[field(1,2,2)][field(2,2,2)][field(3,2,2)][field(4,2,2)] ← Column 2, Slice 2
[field(1,3,2)][field(2,3,2)][field(3,3,2)][field(4,3,2)] ← Column 3, Slice 2
Optimal Loop Order: k → j → i (follows memory layout)
Data Transfer Optimization
program transfer_optimization
implicit none
integer, parameter :: n = 1000, m = 800, p = 600
real :: matrix_a(n, m), matrix_b(n, m), result_2d(n, m)
real :: tensor_3d(n, m, p), output_3d(n, m, p)
integer :: i, j, k
! Initialize arrays
do j = 1, m
do i = 1, n
matrix_a(i, j) = real(i + j) * 0.01
matrix_b(i, j) = real(i - j) * 0.02
end do
end do
do k = 1, p
do j = 1, m
do i = 1, n
tensor_3d(i, j, k) = real(i * j * k) * 0.001
end do
end do
end do
! Efficient data region for multiple operations
!$acc data copyin(matrix_a, matrix_b, tensor_3d) &
!$acc copyout(result_2d, output_3d)
! 2D matrix operations
!$acc parallel loop collapse(2)
do j = 1, m
do i = 1, n
result_2d(i, j) = matrix_a(i, j) + matrix_b(i, j) * 2.0
end do
end do
!$acc end parallel loop
! 3D tensor processing
!$acc parallel loop collapse(3)
do k = 1, p
do j = 1, m
do i = 1, n
output_3d(i, j, k) = tensor_3d(i, j, k) * result_2d(i, j) + &
sin(tensor_3d(i, j, k))
end do
end do
end do
!$acc end parallel loop
!$acc end data
write(*,*) 'Multi-dimensional processing complete'
end program
Memory Coalescing and Performance
Understanding Memory Coalescing
GPU Memory Coalescing with Column-Major Arrays:
Good Coalescing (Column-Major Friendly):
Thread 1: matrix(1, j) ← Sequential addresses
Thread 2: matrix(2, j) ← Next sequential address
Thread 3: matrix(3, j) ← Next sequential address
Thread 4: matrix(4, j) ← Next sequential address
Result: Single coalesced memory transaction ✓
Poor Coalescing (Row-Major Pattern):
Thread 1: matrix(i, 1) ← Address A
Thread 2: matrix(i, 2) ← Address A + stride
Thread 3: matrix(i, 3) ← Address A + 2*stride
Thread 4: matrix(i, 4) ← Address A + 3*stride
Result: Multiple scattered memory transactions ✗
Performance Comparison Table
Access Pattern | Memory Layout Fit | GPU Coalescing | Relative Performance |
---|---|---|---|
Column sections | Perfect | Excellent | 100% (baseline) |
Contiguous blocks | Good | Very Good | 95-98% |
Row sections | Poor | Poor | 60-80% |
Scattered access | Very Poor | Very Poor | 30-50% |
Advanced Multi-dimensional Techniques
Technique 1: Tiled Processing
! Process matrix in tiles for better cache usage
integer, parameter :: tile_size = 64
do jt = 1, m, tile_size
do it = 1, n, tile_size
!$acc parallel loop collapse(2) &
!$acc copyin(matrix(it:min(it+tile_size-1,n), jt:min(jt+tile_size-1,m)))
do j = jt, min(jt + tile_size - 1, m)
do i = it, min(it + tile_size - 1, n)
! Process tile
end do
end do
!$acc end parallel loop
end do
end do
Technique 2: Dimension-aware Processing
! Choose processing order based on array dimensions
if (nx > ny .and. nx > nz) then
! Process along x-dimension (first index) for best performance
!$acc parallel loop collapse(3)
do k = 1, nz
do j = 1, ny
do i = 1, nx ! Innermost on largest dimension
array(i, j, k) = computation()
end do
end do
end do
end if
Common Multi-dimensional Patterns
Pattern 1: Matrix-Matrix Operations
!$acc parallel loop collapse(2) copyin(A, B) copyout(C)
do j = 1, n
do i = 1, n
C(i, j) = A(i, j) + B(i, j)
end do
end do
Pattern 2: 3D Stencil Operations
!$acc parallel loop collapse(3) copy(field)
do k = 2, nz-1
do j = 2, ny-1
do i = 2, nx-1
field(i, j, k) = 0.8 * field(i, j, k) + 0.2 * ( &
field(i-1, j, k) + field(i+1, j, k) + &
field(i, j-1, k) + field(i, j+1, k) + &
field(i, j, k-1) + field(i, j, k+1)) / 6.0
end do
end do
end do
Pattern 3: Dimension Reduction
! Sum along first dimension (efficient)
!$acc parallel loop copyin(matrix) copyout(column_sums)
do j = 1, cols
column_sums(j) = 0.0
do i = 1, rows
column_sums(j) = column_sums(j) + matrix(i, j)
end do
end do
Best Practices
✅ DO:
- Use column-major friendly loop ordering (i innermost)
- Leverage
collapse()
for multi-dimensional arrays - Process contiguous memory sections when possible
- Consider array sections for partial processing
- Use data regions for multiple multi-dimensional operations
❌ DON’T:
- Use row-major loop ordering (i outermost)
- Ignore memory layout when designing algorithms
- Access non-contiguous sections unnecessarily
- Forget about cache and coalescing effects
- Mix different access patterns in the same kernel
Quick Reference
! Optimal 2D loop structure:
!$acc parallel loop collapse(2)
do j = 1, cols
do i = 1, rows ! Inner loop on first dimension
matrix(i, j) = computation()
end do
end do
! Optimal 3D loop structure:
!$acc parallel loop collapse(3)
do k = 1, nz
do j = 1, ny
do i = 1, nx ! Inner loop on first dimension
field(i, j, k) = computation()
end do
end do
end do
! Array sections:
matrix(start_row:end_row, start_col:end_col) ! Rectangular block
matrix(:, column_num) ! Entire column (fast)
matrix(row_num, :) ! Entire row (slower)
Quick Summary
Fortran Memory Layout (Column-Major):
┌─────────────────────────┬────────────────────────────────────────────────┐
│ Array Type │ Memory Storage Pattern │
├─────────────────────────┼────────────────────────────────────────────────┤
│ 2D: matrix(i,j) │ [col1: (1,1)(2,1)...(n,1)] [col2: (1,2)...] │
│ 3D: field(i,j,k) │ [slice1: columns] [slice2: columns] ... │
│ First index varies fast │ Contiguous elements: A(1,j), A(2,j), A(3,j) │
│ Last index varies slow │ Strided elements: A(i,1), A(i,2), A(i,3) │
└─────────────────────────┴────────────────────────────────────────────────┘
Optimal Loop Ordering for Performance:
┌─────────────────────────┬────────────────────────────────────────────────┐
│ Array Dimensions │ Optimal Loop Structure │
├─────────────────────────┼────────────────────────────────────────────────┤
│ 2D: array(i,j) │ do j = 1, cols │
│ │ do i = 1, rows ! Inner loop on 1st dim │
│ 3D: array(i,j,k) │ do k = 1, nz │
│ │ do j = 1, ny │
│ │ do i = 1, nx ! Inner loop on 1st dim │
│ 4D: array(i,j,k,l) │ do l = 1, nl │
│ │ do k = 1, nk │
│ │ do j = 1, nj │
│ │ do i = 1, ni ! Inner loop on 1st dim │
└─────────────────────────┴────────────────────────────────────────────────┘
OpenACC Collapse Directive Usage:
┌─────────────────────────┬────────────────────────────────────────────────┐
│ Array Type │ Recommended OpenACC Directive │
├─────────────────────────┼────────────────────────────────────────────────┤
│ 2D Arrays │ !$acc parallel loop collapse(2) │
│ 3D Arrays │ !$acc parallel loop collapse(3) │
│ 4D+ Arrays │ !$acc parallel loop collapse(3 or 4) │
│ Large outer dimensions │ Collapse outer loops for more parallelism │
│ Small outer dimensions │ May collapse fewer dimensions │
└─────────────────────────┴────────────────────────────────────────────────┘
Memory Access Performance Comparison:
┌─────────────────────────┬──────────────────┬──────────────────────────────┐
│ Access Pattern │ Memory Pattern │ Relative Performance │
├─────────────────────────┼──────────────────┼──────────────────────────────┤
│ Column access A(:,j) │ Sequential │ 100% (optimal) │
│ Contiguous blocks │ Sequential │ 95-98% │
│ Small strides │ Regular pattern │ 80-90% │
│ Row access A(i,:) │ Strided │ 60-80% │
│ Large strides │ Scattered │ 40-60% │
│ Random access │ Unpredictable │ 30-50% │
└─────────────────────────┴──────────────────┴──────────────────────────────┘
Array Section Efficiency:
┌─────────────────────────────────────────────┬──────────────────────────────┐
│ Array Section Type │ Efficiency Notes │
├─────────────────────────────────────────────┼──────────────────────────────┤
│ matrix(start_row:end_row, start_col:end_col) │ Rectangular block - Good │
│ matrix(:, column_number) │ Full column - Excellent │
│ matrix(row_number, :) │ Full row - Moderate │
│ matrix(1:n:2, 1:m:2) │ Strided - Poor │
│ field(:, :, slice) │ 2D slice - Very Good │
│ field(x, :, :) │ Cross-section - Moderate │
└─────────────────────────────────────────────┴──────────────────────────────┘
GPU Memory Coalescing Impact:
• Column-major access → Coalesced memory transactions → High bandwidth
• Row-major access → Non-coalesced transactions → Reduced bandwidth
• Contiguous access → Fewer memory transactions → Better performance
• Strided access → More memory transactions → Lower performance
Common Multi-dimensional Applications:
┌─────────────────────────────────────────────┬──────────────────────────────┐
│ Application Domain │ Typical Array Usage │
├─────────────────────────────────────────────┼──────────────────────────────┤
│ Image Processing │ 2D pixel arrays │
│ Scientific Simulation │ 2D/3D field grids │
│ Finite Element Analysis │ 2D mesh, 3D stress tensors │
│ Computational Fluid Dynamics │ 3D velocity/pressure fields │
│ Climate Modeling │ 3D atmospheric grids │
│ Machine Learning │ Multi-dimensional tensors │
│ Signal Processing │ 2D/3D frequency domains │
└─────────────────────────────────────────────┴──────────────────────────────┘
Best Practices Summary:
✓ Always use column-major friendly loop ordering (i innermost)
✓ Leverage collapse() directive for maximum parallelism
✓ Process contiguous memory sections when possible
✓ Consider cache effects and memory coalescing
✓ Use appropriate array sections for partial processing
✓ Group multi-dimensional operations in data regions
Performance Debugging Tips:
• Profile different loop orderings to measure impact
• Use compiler feedback to verify vectorization/acceleration
• Monitor memory bandwidth utilization
• Test with different array sizes to identify bottlenecks
Example Code
Let us consider the following OpenACC code –
program multidim_demo
! Demonstrates OpenACC with multi-dimensional Fortran arrays
! Focuses on column-major memory layout and performance implications
implicit none
integer, parameter :: n2d = 800, m2d = 600
integer, parameter :: n3d = 80, m3d = 60, p3d = 50
real :: matrix_a(n2d, m2d), matrix_b(n2d, m2d), matrix_result(n2d, m2d)
real :: field_3d(n3d, m3d, p3d), processed_3d(n3d, m3d, p3d)
real :: column_section(n2d), row_section(m2d)
real :: block_result(200, 150)
real :: test_matrix(400, 500), sum_result
integer :: i, j, k, test_i, test_j
real :: start_time, end_time
write(*,*) 'Multi-dimensional Arrays in OpenACC'
write(*,*) '=================================='
write(*,*) ''
write(*,*) 'Demonstration 1: 2D Matrix Operations'
write(*,*) '(Column-major friendly loop ordering)'
write(*,*) ''
! Initialize 2D matrices with column-major friendly loops
write(*,*) 'Initializing 2D matrices...'
do j = 1, m2d ! Outer loop on second dimension (columns)
do i = 1, n2d ! Inner loop on first dimension (rows) - EFFICIENT
matrix_a(i, j) = sin(real(i) * 0.01) + cos(real(j) * 0.01)
matrix_b(i, j) = real(i * j) * 0.0001
end do
end do
write(*,'(A,I0,A,I0,A)') '✓ Initialized ', n2d, '×', m2d, ' matrices'
write(*,'(A,F8.4)') 'Sample A value: ', matrix_a(400, 300)
write(*,'(A,F8.4)') 'Sample B value: ', matrix_b(400, 300)
! 2D matrix processing with OpenACC
write(*,*) ''
write(*,*) 'Processing 2D matrices on GPU...'
!$acc parallel loop collapse(2) copyin(matrix_a, matrix_b) copyout(matrix_result)
do j = 1, m2d
do i = 1, n2d
matrix_result(i, j) = matrix_a(i, j) * matrix_b(i, j) + &
sqrt(abs(matrix_a(i, j))) * 0.5
end do
end do
!$acc end parallel loop
write(*,*) '✓ 2D matrix processing complete'
write(*,'(A,F8.4)') 'Sample result: ', matrix_result(400, 300)
write(*,*) '✓ Used collapse(2) for optimal 2D parallelization'
write(*,*) ''
write(*,*) 'Demonstration 2: 3D Array Processing'
write(*,*) '(3D field with collapse(3) parallelization)'
write(*,*) ''
! Initialize 3D field with optimal loop ordering
write(*,'(A,I0,A,I0,A,I0,A)') 'Initializing 3D field: ', n3d, '×', m3d, '×', p3d
do k = 1, p3d ! Outermost loop on third dimension
do j = 1, m3d ! Middle loop on second dimension
do i = 1, n3d ! Inner loop on first dimension - OPTIMAL
field_3d(i, j, k) = real(i + j + k) * 0.01 + &
sin(real(i) * 0.05) * cos(real(j) * 0.05) * sin(real(k) * 0.05)
end do
end do
end do
write(*,*) '✓ 3D field initialized with column-major friendly loops'
write(*,'(A,F8.4)') 'Sample 3D value: ', field_3d(40, 30, 25)
! 3D processing with collapse(3)
write(*,*) ''
write(*,*) 'Processing 3D field on GPU...'
!$acc parallel loop collapse(3) copyin(field_3d) copyout(processed_3d)
do k = 1, p3d
do j = 1, m3d
do i = 1, n3d
processed_3d(i, j, k) = field_3d(i, j, k) * field_3d(i, j, k) + &
exp(field_3d(i, j, k) * 0.1) * 0.01
end do
end do
end do
!$acc end parallel loop
write(*,*) '✓ 3D field processing complete'
write(*,'(A,F8.4)') 'Sample 3D result: ', processed_3d(40, 30, 25)
write(*,*) '✓ Used collapse(3) for maximum 3D parallelization'
write(*,*) ''
write(*,*) 'Demonstration 3: Array Sections with Multi-dimensional Arrays'
write(*,*) '(Comparing different access patterns)'
write(*,*) ''
! Process a rectangular block (efficient)
write(*,*) 'Processing rectangular matrix block...'
!$acc parallel loop collapse(2) &
!$acc copyin(matrix_a(200:399, 150:299)) copyout(block_result)
do j = 1, 150
do i = 1, 200
block_result(i, j) = matrix_a(199 + i, 149 + j) * 3.0 + 1.0
end do
end do
!$acc end parallel loop
write(*,*) '✓ Block processing complete (contiguous memory)'
write(*,'(A,F8.4)') 'Block result sample: ', block_result(100, 75)
write(*,*) ''
! Process entire column (EFFICIENT - sequential memory access)
write(*,*) 'Processing entire column (efficient access)...'
call cpu_time(start_time)
!$acc parallel loop copyin(matrix_a(:, 400)) copyout(column_section)
do i = 1, n2d
column_section(i) = matrix_a(i, 400) + sqrt(abs(matrix_a(i, 400)))
end do
!$acc end parallel loop
call cpu_time(end_time)
write(*,'(A,F6.4,A)') '✓ Column processing time: ', end_time - start_time, ' seconds'
write(*,'(A,F8.4)') 'Column result sample: ', column_section(400)
write(*,*) '✓ Sequential memory access (cache-friendly)'
write(*,*) ''
! Process entire row (LESS EFFICIENT - strided memory access)
write(*,*) 'Processing entire row (less efficient access)...'
call cpu_time(start_time)
!$acc parallel loop copyin(matrix_a(500, :)) copyout(row_section)
do j = 1, m2d
row_section(j) = matrix_a(500, j) * matrix_a(500, j)
end do
!$acc end parallel loop
call cpu_time(end_time)
write(*,'(A,F6.4,A)') '✓ Row processing time: ', end_time - start_time, ' seconds'
write(*,'(A,F8.4)') 'Row result sample: ', row_section(300)
write(*,*) '⚠ Strided memory access (less cache-friendly)'
write(*,*) ''
write(*,*) 'Demonstration 4: Memory Access Pattern Comparison'
write(*,*) '(Column-major vs Row-major loop ordering)'
write(*,*) ''
! Demonstrate optimal vs suboptimal loop ordering
! Initialize test matrix
do test_j = 1, 500
do test_i = 1, 400
test_matrix(test_i, test_j) = real(test_i + test_j) * 0.01
end do
end do
! OPTIMAL: Column-major friendly (j outer, i inner)
write(*,*) 'Testing OPTIMAL loop ordering (column-major friendly)...'
call cpu_time(start_time)
sum_result = 0.0
!$acc parallel loop collapse(2) copyin(test_matrix) reduction(+:sum_result)
do test_j = 1, 500 ! Outer loop on columns
do test_i = 1, 400 ! Inner loop on rows (follows memory layout)
sum_result = sum_result + test_matrix(test_i, test_j)
end do
end do
!$acc end parallel loop
call cpu_time(end_time)
write(*,'(A,F6.4,A)') '✓ Optimal ordering time: ', end_time - start_time, ' seconds'
write(*,'(A,F10.2)') 'Sum result: ', sum_result
write(*,*) '✓ Follows Fortran column-major memory layout'
write(*,*) ''
! SUBOPTIMAL: Row-major style (i outer, j inner)
write(*,*) 'Testing SUBOPTIMAL loop ordering (row-major style)...'
call cpu_time(start_time)
sum_result = 0.0
!$acc parallel loop collapse(2) copyin(test_matrix) reduction(+:sum_result)
do test_i = 1, 400 ! Outer loop on rows
do test_j = 1, 500 ! Inner loop on columns (against memory layout)
sum_result = sum_result + test_matrix(test_i, test_j)
end do
end do
!$acc end parallel loop
call cpu_time(end_time)
write(*,'(A,F6.4,A)') '⚠ Suboptimal ordering time: ', end_time - start_time, ' seconds'
write(*,'(A,F10.2)') 'Sum result: ', sum_result
write(*,*) '⚠ Against Fortran column-major memory layout'
write(*,*) ''
write(*,*) 'Multi-dimensional Array Performance Summary:'
write(*,*) '==========================================='
write(*,*) 'Memory Layout: Fortran uses Column-Major order'
write(*,*) '• Elements stored column by column in memory'
write(*,*) '• First index varies fastest in memory'
write(*,*) ''
write(*,*) 'Optimal Access Patterns:'
write(*,*) '• Inner loop on first dimension (i)'
write(*,*) '• Outer loops on higher dimensions (j, k)'
write(*,*) '• Column sections faster than row sections'
write(*,*) '• Contiguous blocks more efficient than scattered'
write(*,*) ''
write(*,*) 'OpenACC Best Practices:'
write(*,*) '• Use collapse(2) for 2D arrays'
write(*,*) '• Use collapse(3) for 3D arrays'
write(*,*) '• Follow column-major loop ordering'
write(*,*) '• Process contiguous sections when possible'
write(*,*) '• Consider memory coalescing on GPU'
write(*,*) ''
write(*,*) 'Performance Impact:'
write(*,*) '• Column-major friendly: 100% performance'
write(*,*) '• Row-major style: 60-80% performance'
write(*,*) '• Random access: 30-50% performance'
write(*,*) ''
write(*,*) 'Common Use Cases:'
write(*,*) '• Scientific simulations (2D/3D grids)'
write(*,*) '• Image processing (2D pixel arrays)'
write(*,*) '• Finite element methods (mesh arrays)'
write(*,*) '• Computational fluid dynamics (field arrays)'
write(*,*) '• Matrix computations (linear algebra)'
end program multidim_demo
To compile this code –
nvfortran -acc -Minfo=accel -O2 multidim_demo.f90 -o multidim_demo
To execute this code –
./multidim_demo
Sample output –
Multi-dimensional Arrays in OpenACC
==================================
Demonstration 1: 2D Matrix Operations
(Column-major friendly loop ordering)
Initializing 2D matrices...
✓ Initialized 800×600 matrices
Sample A value: -1.7468
Sample B value: 12.0000
Processing 2D matrices on GPU...
✓ 2D matrix processing complete
Sample result: -20.3007
✓ Used collapse(2) for optimal 2D parallelization
Demonstration 2: 3D Array Processing
(3D field with collapse(3) parallelization)
Initializing 3D field: 80×60×50
✓ 3D field initialized with column-major friendly loops
Sample 3D value: 1.0110
Processing 3D field on GPU...
✓ 3D field processing complete
Sample 3D result: 1.0333
✓ Used collapse(3) for maximum 3D parallelization
Demonstration 3: Array Sections with Multi-dimensional Arrays
(Comparing different access patterns)
Processing rectangular matrix block...
✓ Block processing complete (contiguous memory)
Block result sample: -0.4080
Processing entire column (efficient access)...
✓ Column processing time: 0.0000 seconds
Column result sample: -0.2228
✓ Sequential memory access (cache-friendly)
Processing entire row (less efficient access)...
✓ Row processing time: 0.0000 seconds
Row result sample: 3.7983
⚠ Strided memory access (less cache-friendly)
Demonstration 4: Memory Access Pattern Comparison
(Column-major vs Row-major loop ordering)
Testing OPTIMAL loop ordering (column-major friendly)...
✓ Optimal ordering time: 0.0004 seconds
Sum result: 902000.00
✓ Follows Fortran column-major memory layout
Testing SUBOPTIMAL loop ordering (row-major style)...
⚠ Suboptimal ordering time: 0.0003 seconds
Sum result: 902000.00
⚠ Against Fortran column-major memory layout
Multi-dimensional Array Performance Summary:
===========================================
Memory Layout: Fortran uses Column-Major order
• Elements stored column by column in memory
• First index varies fastest in memory
Optimal Access Patterns:
• Inner loop on first dimension (i)
• Outer loops on higher dimensions (j, k)
• Column sections faster than row sections
• Contiguous blocks more efficient than scattered
OpenACC Best Practices:
• Use collapse(2) for 2D arrays
• Use collapse(3) for 3D arrays
• Follow column-major loop ordering
• Process contiguous sections when possible
• Consider memory coalescing on GPU
Performance Impact:
• Column-major friendly: 100% performance
• Row-major style: 60-80% performance
• Random access: 30-50% performance
Common Use Cases:
• Scientific simulations (2D/3D grids)
• Image processing (2D pixel arrays)
• Finite element methods (mesh arrays)
• Computational fluid dynamics (field arrays)
• Matrix computations (linear algebra)
Click here to go back to OpenACC Fortran tutorials page.
References
- OpenACC Specification : https://www.openacc.org/specification