CUDA : Introduction

by

in

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform that allows scientists and engineers to use GPUs for general-purpose computing. GPUs were built to handle graphics, but CUDA helps them do other types of work too. With CUDA, thousands of cores in a GPU can be used for things like scientific simulations and data analysis. For problems that need lots of calculations happening at the same time, using GPUs with CUDA you can run your code 10 to 100 times faster than just using a regular CPU.

Refer to following diagram GPU vs CPU architecture comparison.

GPU vs CPU architecture

Core Concept

CPUs are good at doing tasks one after another. They have 4 to 32 strong cores that can handle complicated jobs. GPUs work differently. They have thousands of simple cores that can do many things at the same time. CUDA helps programmers use GPUs easily by letting them write code in C or C++.

Key Points

  • Massive Parallelism: Modern GPUs contain thousands of CUDA cores (e.g., RTX 5090 has 21760 cores)
  • SIMT Architecture: Single Instruction, Multiple Threads—ideal for data-parallel workloads
  • Memory Bandwidth: GPUs provide 10-20x higher memory bandwidth than CPUs (up to 1 TB/s)
  • Specialized Hardware: Tensor cores for AI, ray tracing cores for graphics

Code Example

Let’s see a Checking CUDA availability and printing basic device information

CUDA Implementation:

#include <stdio.h>
int main() 
{
    int deviceCount = 0;
    cudaError_t error = cudaGetDeviceCount(&deviceCount);

    if (error != cudaSuccess) {
        printf("CUDA Error: %s\n", cudaGetErrorString(error));
        return 1;
    }

    printf("Found %d CUDA device(s)\n", deviceCount);

    for (int dev = 0; dev < deviceCount; dev++) {
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, dev);

        printf("\nDevice %d: %s\n", dev, prop.name);
        printf("  Compute Capability: %d.%d\n", prop.major, prop.minor);
        printf("  Total Global Memory: %.2f GB\n", prop.totalGlobalMem / 1e9);
    }
    return 0;
}

Code Highlights:

  • cudaGetDeviceCount() queries available GPUs in the system
  • cudaGetDeviceProperties() retrieves detailed GPU specifications
  • Compute Capability indicates GPU generation (e.g., 7.5 for Turing, 8.6 for Ampere)

Compilation Command:

nvcc cuda_devices.cu

Example Output:

Found 1 CUDA device(s)

Device 0: NVIDIA GeForce RTX 5060 Ti
  Compute Capability: 12.0
  Total Global Memory: 16.60 GB

Usage & Best Practices

When to Use CUDA?

  • Data-parallel problems: processing arrays, matrices, images
  • Computationally intensive simulations: physics, molecular dynamics, CFD
  • Machine learning: training neural networks, inference
  • Signal/image processing: filtering, transformations, FFT

Best Practices :

  • Start by identifying parallel portions of your code
  • Use CUDA for computationally intensive tasks (not I/O bound operations)
  • Consider existing libraries (cuBLAS, cuFFT) before writing custom kernels

Common Mistakes to Avoid :

  • Avoid: Using GPU for small datasets (memory transfer overhead dominates)

Summary:

  • CUDA enables general-purpose computing on NVIDIA GPUs with C/C++ extensions
  • GPUs excel at data-parallel tasks with thousands of simultaneous threads
  • Memory bandwidth and core count are key GPU advantages
  • CUDA toolkit provides compilers, libraries, and debugging tools
  • Check GPU availability and properties before developing applications

References:

  1. CUDA Official Documentation

Go back to CUDA tutorials.


Mandar Gurav Avatar

Mandar Gurav

Parallel Programmer, Trainer and Mentor


If you are new to Parallel Programming you can start here.



Beginner CUDA Fortran Hello World Message Passing Interface MPI Nvidia Nsight Systems NVPROF OpenACC OpenACC Fortran OpenMP PGI Fortran Compiler Profiling Vector Addition


Popular Categories