CUDA : Introduction

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform that allows scientists and engineers to use GPUs for general-purpose computing. GPUs were built to handle graphics, but CUDA helps them do other types of work too. With CUDA, thousands of cores in a GPU can be used for things like scientific simulations and data analysis. For problems that need lots of calculations happening at the same time, using GPUs with CUDA you can run your code 10 to 100 times faster than just using a regular CPU.

Refer to following diagram GPU vs CPU architecture comparison.

Core Concept

CPUs are good at doing tasks one after another. They have 4 to 32 strong cores that can handle complicated jobs. GPUs work differently. They have thousands of simple cores that can do many things at the same time. CUDA helps programmers use GPUs easily by letting them write code in C or C++.

Key Points

Massive Parallelism: Modern GPUs contain thousands of CUDA cores (e.g., RTX 5090 has 21760 cores)
SIMT Architecture: Single Instruction, Multiple Threads—ideal for data-parallel workloads
Memory Bandwidth: GPUs provide 10-20x higher memory bandwidth than CPUs (up to 1 TB/s)
Specialized Hardware: Tensor cores for AI, ray tracing cores for graphics

Code Example

Let’s see a Checking CUDA availability and printing basic device information

CUDA Implementation:

#include <stdio.h>
int main() 
{
    int deviceCount = 0;
    cudaError_t error = cudaGetDeviceCount(&deviceCount);

    if (error != cudaSuccess) {
        printf("CUDA Error: %s\n", cudaGetErrorString(error));
        return 1;
    }

    printf("Found %d CUDA device(s)\n", deviceCount);

    for (int dev = 0; dev < deviceCount; dev++) {
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, dev);

        printf("\nDevice %d: %s\n", dev, prop.name);
        printf("  Compute Capability: %d.%d\n", prop.major, prop.minor);
        printf("  Total Global Memory: %.2f GB\n", prop.totalGlobalMem / 1e9);
    }
    return 0;
}

Code Highlights:

cudaGetDeviceCount() queries available GPUs in the system
cudaGetDeviceProperties() retrieves detailed GPU specifications
Compute Capability indicates GPU generation (e.g., 7.5 for Turing, 8.6 for Ampere)

Compilation Command:

nvcc cuda_devices.cu

Example Output:

Found 1 CUDA device(s)

Device 0: NVIDIA GeForce RTX 5060 Ti
  Compute Capability: 12.0
  Total Global Memory: 16.60 GB

Usage & Best Practices

When to Use CUDA?

Data-parallel problems: processing arrays, matrices, images
Computationally intensive simulations: physics, molecular dynamics, CFD
Machine learning: training neural networks, inference
Signal/image processing: filtering, transformations, FFT

Best Practices :

Start by identifying parallel portions of your code
Use CUDA for computationally intensive tasks (not I/O bound operations)
Consider existing libraries (cuBLAS, cuFFT) before writing custom kernels

Common Mistakes to Avoid :

Avoid: Using GPU for small datasets (memory transfer overhead dominates)

Summary:

CUDA enables general-purpose computing on NVIDIA GPUs with C/C++ extensions
GPUs excel at data-parallel tasks with thousands of simultaneous threads
Memory bandwidth and core count are key GPU advantages
CUDA toolkit provides compilers, libraries, and debugging tools
Check GPU availability and properties before developing applications

References:

CUDA Official Documentation

Go back to CUDA tutorials.

Learn Parallel Programming

Core Concept

Key Points

Code Example

Usage & Best Practices

Summary:

Mandar Gurav

Popular Categories