CUDA : Vector Addition Example

Vector addition (C[i] = A[i] + B[i]) is the our first parallel CUDA program, integrating memory management, data transfer, kernel execution, and error handling. This complete example demonstrates the full CUDA workflow: allocate device memory with cudaMalloc(), copy data with cudaMemcpy(), launch parallel kernel, retrieve results, verify correctness, and free allocated memories.

Refer to following diagram for complete workflow visualization

Core Concept

Vector addition works by adding elements one by one using different threads on the GPU. Each thread adds two numbers: C[i] = A[i] + B[i].

The process has five steps:

Create memory for data in both host (computer) and device (GPU).
Make input vectors on the computer.
Send inputs to the GPU (cudaMemcpy H2D).
Run a special program called a kernel on the GPU with set thread sizes.
Get results back from the GPU (cudaMemcpy D2H) and check if it is correct.

This pattern—create memory, transfer data, compute, get results—is used for all GPU-based algorithms.

Key Points

Memory Allocation: cudaMalloc() for device, malloc()/new for host
Data Transfer: cudaMemcpy() with cudaMemcpyHostToDevice/cudaMemcpyDeviceToHost
Kernel Launch: kernel<<<blocks, threads>>>() with number of blocks and threads
Thread Mapping: Global index i = blockIdx.x * blockDim.x + threadIdx.x
Cleanup: cudaFree() and free() to prevent memory leaks
Verification: Compare GPU result against CPU calculation

Code Example

Add two vectors of N elements in parallel

Complete CUDA Program:

#include <stdio.h>
#include <cuda_runtime.h>

#define N 1000000
#define CUDA_CHECK(call) { \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        printf("CUDA error: %s\n", cudaGetErrorString(err)); \
        exit(1); \
    } \
}

__global__ void vectorAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    size_t bytes = N * sizeof(float);

    // Allocate host memory
    float *h_a = (float*)malloc(bytes);
    float *h_b = (float*)malloc(bytes);
    float *h_c = (float*)malloc(bytes);

    // Initialize input vectors
    for (int i = 0; i < N; i++) {
        h_a[i] = i * 1.0f;
        h_b[i] = i * 2.0f;
    }

    // Allocate device memory
    float *d_a, *d_b, *d_c;
    CUDA_CHECK(cudaMalloc(&d_a, bytes));
    CUDA_CHECK(cudaMalloc(&d_b, bytes));
    CUDA_CHECK(cudaMalloc(&d_c, bytes));

    // Copy to device
    CUDA_CHECK(cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice));
    CUDA_CHECK(cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice));

    // Launch kernel
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, N);
    CUDA_CHECK(cudaGetLastError());
    CUDA_CHECK(cudaDeviceSynchronize());

    // Copy result back
    CUDA_CHECK(cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost));

    // Verify
    for (int i = 0; i < N; i++) {
        if (fabs(h_c[i] - (h_a[i] + h_b[i])) > 1e-5) {
            printf("Error at %d\n", i);
            break;
        }
    }
    printf("Success!\n");

    // Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);
    return 0;
}

Usage & Best Practices

Performance Considerations

Block Size: Multiples of warp size (32), typically 128/256/512 threads
Grid Size: Ensure blocks * threads >= N for complete coverage
Memory Transfer: Minimize cudaMemcpy() calls
Asynchronous Operations: Use streams (will be covered later) for overlapping computation and transfer

Best Practices

Always check bounds in kernel (if (i < n))
Verify results on small datasets first
Use CUDA_CHECK macro for all API calls
Profile (will be covered later) with Nsight Systems or Nsight Compute

Common Mistakes

Forgetting cudaDeviceSynchronize() before timing
Memory leaks from missing cudaFree()/free()

Key Takeaways

Summary:

Vector addition demonstrates complete CUDA workflow
Five-step pattern: allocate, initialize, transfer, compute, retrieve
Each thread processes one array element independently
Grid/block sizing ensures all elements are processed
Error checking and verification are essential
Memory management requires matching alloc/free pairs
This pattern scales to complex algorithms

Quick Reference

Complete Workflow:

// 1. Allocate
float *d_data;
cudaMalloc(&d_data, bytes);

// 2. Transfer to device
cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);

// 3. Execute
kernel<<<blocks, threads>>>(d_data, n);

// 4. Transfer from device
cudaMemcpy(h_result, d_result, bytes, cudaMemcpyDeviceToHost);

// 5. Free
cudaFree(d_data);

Grid/Block Calculation:

int threads = 256;  // Power of 2, multiple of 32
int blocks = (N + threads - 1) / threads;  // Ceiling division

Memory Copy Directions:

Direction	Flag
Host → Device	`cudaMemcpyHostToDevice`
Device → Host	`cudaMemcpyDeviceToHost`
Device → Device	`cudaMemcpyDeviceToDevice`
Host → Host	`cudaMemcpyHostToHost`

Verification Pattern:

bool verify = true;
for (int i = 0; i < N; i++) {
    float expected = h_a[i] + h_b[i];
    if (fabs(h_c[i] - expected) > 1e-5) {
        printf("Mismatch at index %d\n", i);
        verify = false;
        break;
    }
}

Performance Metrics:

Bandwidth: Compare actual vs. peak memory bandwidth
Speedup: GPU time vs. CPU time
Memory Efficiency: (Actual bandwidth / Peak bandwidth) × 100%

References:

CUDA Official Documentation

Go back to CUDA tutorials.

Learn Parallel Programming