CUDA: Thread Indexing and IDs

by

in

Thread indexing is how each parallel thread determines which data element to process. Computing a unique global thread ID from threadIdx, blockIdx, and blockDim enables thousands of threads to safely access different array elements without conflicts. This way of connecting threads to data is very important for all kinds of GPU tasks. It works for simple things like adding numbers in an array, and also for more complex jobs like working with big matrices.

Refer to following diagram for visual guide to global index calculation

cuda-thread-indexing-ids

Core Concept

CUDA organizes threads in a hierarchy: grid, then blocks, and finally threads. But data arrays are usually flat or multi-dimensional. Each thread needs to know which part of the array it should work on. We use a formula to figure out the global index for each thread. The formula blockIdx.x * blockDim.x + threadIdx.x helps us find this index.

Here’s how it works:

  • blockIdx.x tells us which block we are in.
  • blockDim.x tells us how many threads are in one block.
  • threadIdx.x tells us the position of the thread within its block.

So, when you multiply the number of blocks before this one by the number of threads in each block and add your own thread’s position, you get the global index for that thread. This helps each thread know exactly which part of the array it should process.

Key Points

  • Global Index: Unique identifier mapping thread to data element
  • 1D Formula: idx = blockIdx.x * blockDim.x + threadIdx.x
  • 2D Formula: x = blockIdx.x * blockDim.x + threadIdx.x, y = blockIdx.y * blockDim.y + threadIdx.y
  • Bounds Checking: Always verify idx < N to prevent out-of-bounds access
  • Row-Major Order: For 2D arrays, compute idx = y * width + x

Code Example

Kernel demonstrating 1D and 2D thread indexing for array access –

CUDA Implementation:

// 1D indexing - processing an array
__global__ void vectorDouble(float *data, int N) {
    // Calculate global thread index
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Bounds check - crucial for safety
    if (idx < N) {
        data[idx] = data[idx] * 2.0f;
    }
}

// 2D indexing - processing a matrix
__global__ void matrixAdd(float *A, float *B, float *C, int width, int height) {
    // Calculate 2D indices
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    // Bounds check for both dimensions
    if (col < width && row < height) {
        // Convert 2D index to 1D array index
        int idx = row * width + col;
        C[idx] = A[idx] + B[idx];
    }
}

Launch Configuration:

// For 1D: N=10000 elements
int threadsPerBlock = 256;
int numBlocks = (N + threadsPerBlock - 1) / threadsPerBlock;
vectorDouble<<<numBlocks, threadsPerBlock>>>(d_data, N);

// For 2D: 1024x768 matrix
dim3 block(16, 16);  // 256 threads per block
dim3 grid((width + 15) / 16, (height + 15) / 16);
matrixAdd<<<grid, block>>>(d_A, d_B, d_C, width, height);

Usage & Best Practices

When to Use Each Pattern

  • 1D indexing: Vectors, lists, simple array operations
  • 2D indexing: Matrices, images, grids
  • 3D indexing: Volumes, tensors, 3D simulations

Best Practices

  • Always include bounds checking (if (idx < N))
  • Calculate grid size to cover all elements: (N + blockSize - 1) / blockSize
  • Use row-major indexing for memory coalescing (will be covered later): row * width + col

Common Mistakes

  • Avoid: Forgetting bounds check (causes illegal memory access)
  • Avoid: Using column-major indexing (poor performance)

Key Takeaways

Summary:

  • Global thread index maps threads to unique data elements
  • 1D formula: blockIdx.x * blockDim.x + threadIdx.x
  • 2D formula: compute x and y separately, then combine
  • Always bounds check: if (idx < N) prevents errors
  • Calculate grid size to launch enough threads for all data
  • Row-major indexing for efficient memory access

Quick Reference

1D Global Index:

int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
    // Safe to access data[idx]
}

2D Global Index:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < width && y < height) {
    int idx = y * width + x;  // Row-major
    // Safe to access data[idx]
}

3D Global Index:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;
if (x < width && y < height && z < depth) {
    int idx = z * (width * height) + y * width + x;
}

Grid Size Calculation:

int numBlocks = (N + threadsPerBlock - 1) / threadsPerBlock;

References:

  1. CUDA Official Documentation

Go back to CUDA tutorials.


Mandar Gurav Avatar

Mandar Gurav

Parallel Programmer, Trainer and Mentor


If you are new to Parallel Programming you can start here.



Beginner CUDA Fortran Hello World Message Passing Interface MPI Nvidia Nsight Systems NVPROF OpenACC OpenACC Fortran OpenMP PGI Fortran Compiler Profiling Vector Addition


Popular Categories