CUDA Programming Model

by

in

The CUDA programming model splits work between two parts: the CPU (host) and the GPU (device). The CPU controls what happens in the program and sends tasks called kernels to the GPU for processing. To write good CUDA programs, you need to understand how these two parts work together and how tasks are organized into groups called grids and blocks. This model helps match the parallel tasks of your problem with the hardware design of the GPU.

Following diagram provide visual representation of for host-device execution model.

CUDA Programming Model

Core Concept

CUDA programs run in two places: the host and the device.

  • Host: This is your computer’s CPU. It uses traditional sequential logic to do tasks one after another.
  • Device: This is the GPU. It runs many things at once, which is called massively parallel.

The host does three main jobs:

  1. It sets aside (allocates) memory.
  2. It moves data around.
  3. It starts (launches) kernels on the device.

Each time you start a kernel, it creates a grid of thread blocks. Each block has many threads. These threads run the same code but work with different data. This is called SIMT (Single Instruction, Multiple Threads).

Key Points:

  • Host (CPU): Controls program flow, manages memory, launches kernels
  • Device (GPU): Executes parallel kernels with thousands of threads
  • Kernel: Function marked with __global__ that runs on device, callable from host
  • Thread Hierarchy: Grid → Blocks → Threads (3-level organization)
  • SIMT Execution: Single Instruction, Multiple Thread—all threads execute same code path

Code Example

Let’s look at a very simple kernel demonstrating host-device separation and thread hierarchy.

#include <stdio.h>

// Device code: kernel function
__global__ void helloFromGPU() {
    int blockId = blockIdx.x;
    int threadId = threadIdx.x;

    printf("Hello from block %d, thread %d\n", blockId, threadId);
}

// Host code: main function
int main() {
    printf("Host: Launching kernel...\n");

    // Kernel launch configuration
    int numBlocks = 2;
    int threadsPerBlock = 4;

    // Launch kernel: <<<numBlocks, threadsPerBlock>>>
    helloFromGPU<<<numBlocks, threadsPerBlock>>>();

    // Wait for GPU to finish
    cudaDeviceSynchronize();

    printf("Host: Kernel execution complete\n");

    return 0;
}

Code Highlights:

  • __global__ qualifier marks function as kernel (callable from host, runs on device)
  • Built-in variables blockIdx and threadIdx identify each thread uniquely
  • Triple angle brackets <<<...>>> launch kernel with specified configuration
  • cudaDeviceSynchronize() blocks host until all GPU threads complete

Example Output:

Host: Launching kernel...
Hello from block 0, thread 0
Hello from block 0, thread 1
Hello from block 0, thread 2
Hello from block 0, thread 3
Hello from block 1, thread 0
Hello from block 1, thread 1
Hello from block 1, thread 2
Hello from block 1, thread 3
Host: Kernel execution complete

Usage & Best Practices

When to Use This Model :

  • Data-parallel problems: apply same operation to many elements
  • Independent computations: minimal dependencies between threads
  • Regular memory access patterns: arrays, matrices, grids

Best Practices :

  • Choose block size as multiple of 32 (warp size) for efficiency
  • Typical block sizes: 128, 256, or 512 threads
  • Calculate grid size to cover entire dataset: gridSize = (N + blockSize - 1) / blockSize

Common Mistakes to avoid:

  • Avoid: Launching kernels without checking for errors
  • Avoid: Forgetting cudaDeviceSynchronize() when kernel output is needed immediately

Summary

  • Host (CPU) and device (GPU) execute different code sections
  • Kernels (__global__ functions) run massively parallel on GPU
  • Threads organized hierarchically: Grid → Blocks → Threads
  • Launch kernels with <<<numBlocks, threadsPerBlock>>> syntax
  • Built-in variables identify threads: blockIdx, threadIdx, blockDim, gridDim

Next Steps: Tutorial 03 – Writing your first complete CUDA kernel with computation

Complete Code: See examples/ folder for working implementation

Quick Reference

Kernel Qualifiers:

  • __global__: Runs on device, called from host
  • __device__: Runs on device, called from device
  • __host__: Runs on host (default for regular functions)

Compilation and execution:

nvcc -o programming_model programming_model.cu
./programming_model

Built-in Variables:

  • gridDim: Dimensions of grid (in blocks)
  • blockDim: Dimensions of block (in threads)
  • blockIdx: Block index within grid
  • threadIdx: Thread index within block

References:

  1. CUDA Official Documentation

Go back to CUDA tutorials.


Mandar Gurav Avatar

Mandar Gurav

Parallel Programmer, Trainer and Mentor


If you are new to Parallel Programming you can start here.



Beginner CUDA Fortran Hello World Message Passing Interface MPI Nvidia Nsight Systems NVPROF OpenACC OpenACC Fortran OpenMP PGI Fortran Compiler Profiling Vector Addition


Popular Categories