CUDA Programming Model

The CUDA programming model splits work between two parts: the CPU (host) and the GPU (device). The CPU controls what happens in the program and sends tasks called kernels to the GPU for processing. To write good CUDA programs, you need to understand how these two parts work together and how tasks are organized into groups called grids and blocks. This model helps match the parallel tasks of your problem with the hardware design of the GPU.

Following diagram provide visual representation of for host-device execution model.

Core Concept

CUDA programs run in two places: the host and the device.

Host: This is your computer’s CPU. It uses traditional sequential logic to do tasks one after another.
Device: This is the GPU. It runs many things at once, which is called massively parallel.

The host does three main jobs:

It sets aside (allocates) memory.
It moves data around.
It starts (launches) kernels on the device.

Each time you start a kernel, it creates a grid of thread blocks. Each block has many threads. These threads run the same code but work with different data. This is called SIMT (Single Instruction, Multiple Threads).

Key Points:

Host (CPU): Controls program flow, manages memory, launches kernels
Device (GPU): Executes parallel kernels with thousands of threads
Kernel: Function marked with __global__ that runs on device, callable from host
Thread Hierarchy: Grid → Blocks → Threads (3-level organization)
SIMT Execution: Single Instruction, Multiple Thread—all threads execute same code path

Code Example

Let’s look at a very simple kernel demonstrating host-device separation and thread hierarchy.

#include <stdio.h>

// Device code: kernel function
__global__ void helloFromGPU() {
    int blockId = blockIdx.x;
    int threadId = threadIdx.x;

    printf("Hello from block %d, thread %d\n", blockId, threadId);
}

// Host code: main function
int main() {
    printf("Host: Launching kernel...\n");

    // Kernel launch configuration
    int numBlocks = 2;
    int threadsPerBlock = 4;

    // Launch kernel: <<<numBlocks, threadsPerBlock>>>
    helloFromGPU<<<numBlocks, threadsPerBlock>>>();

    // Wait for GPU to finish
    cudaDeviceSynchronize();

    printf("Host: Kernel execution complete\n");

    return 0;
}

Code Highlights:

__global__ qualifier marks function as kernel (callable from host, runs on device)
Built-in variables blockIdx and threadIdx identify each thread uniquely
Triple angle brackets <<<...>>> launch kernel with specified configuration
cudaDeviceSynchronize() blocks host until all GPU threads complete

Example Output:

Host: Launching kernel...
Hello from block 0, thread 0
Hello from block 0, thread 1
Hello from block 0, thread 2
Hello from block 0, thread 3
Hello from block 1, thread 0
Hello from block 1, thread 1
Hello from block 1, thread 2
Hello from block 1, thread 3
Host: Kernel execution complete

Usage & Best Practices

When to Use This Model :

Data-parallel problems: apply same operation to many elements
Independent computations: minimal dependencies between threads
Regular memory access patterns: arrays, matrices, grids

Best Practices :

Choose block size as multiple of 32 (warp size) for efficiency
Typical block sizes: 128, 256, or 512 threads
Calculate grid size to cover entire dataset: gridSize = (N + blockSize - 1) / blockSize

Common Mistakes to avoid:

Avoid: Launching kernels without checking for errors
Avoid: Forgetting cudaDeviceSynchronize() when kernel output is needed immediately

Summary

Host (CPU) and device (GPU) execute different code sections
Kernels (__global__ functions) run massively parallel on GPU
Threads organized hierarchically: Grid → Blocks → Threads
Launch kernels with <<<numBlocks, threadsPerBlock>>> syntax
Built-in variables identify threads: blockIdx, threadIdx, blockDim, gridDim

Next Steps: Tutorial 03 – Writing your first complete CUDA kernel with computation

Complete Code: See examples/ folder for working implementation

Quick Reference

Kernel Qualifiers:

__global__: Runs on device, called from host
__device__: Runs on device, called from device
__host__: Runs on host (default for regular functions)

Compilation and execution:

nvcc -o programming_model programming_model.cu
./programming_model

Built-in Variables:

gridDim: Dimensions of grid (in blocks)
blockDim: Dimensions of block (in threads)
blockIdx: Block index within grid
threadIdx: Thread index within block

References:

CUDA Official Documentation

Go back to CUDA tutorials.

Learn Parallel Programming