CUDA: Device Query

Using cudaGetDeviceProperties() lets your program learn about the GPU’s features. It tells you things like how powerful the GPU is, how much memory it has, and how many multiprocessors it has. This information helps you write better CUDA code that works well on different types of GPUs. For example, it can help you decide the best way to run your program on the GPU if certain features are available. Knowing these details also helps fix problems related to specific hardware. Robust scientific programs need to work well on many kinds of GPUs, from older ones like Maxwell to newer ones like Blackwell.

Core Concept

The cudaDeviceProp structure has complete information about a GPU. You can get this information using the cudaGetDeviceProperties() function. Key properties include compute capability (determines supported features), total global memory, maximum threads per block, multiprocessor count, and shared memory size. Programs use this information to check if they meet hardware requirements, find the best settings for running efficiently, and adjust algorithms to work well with the GPU.

Key Points

cudaGetDeviceProperties(): Retrieves GPU specifications into cudaDeviceProp struct
Compute Capability: Major.minor version (e.g., 7.5 for Turing)
Memory Properties: Total memory, shared memory per block, constant memory
Thread Limits: Max threads per block, max dimensions, warp size
Multiprocessor Info: SM count, CUDA core estimate, clock rates
Feature Support: Concurrent kernels, unified addressing, peer-to-peer

Code Example

Querying and displaying essential GPU properties

CUDA Implementation:

#include <cuda_runtime.h>

int main() {
    int deviceCount;
    cudaGetDeviceCount(&deviceCount);

    for (int dev = 0; dev < deviceCount; dev++) {
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, dev);

        printf("Device %d: %s\n", dev, prop.name);
        printf("  Compute Capability: %d.%d\n", prop.major, prop.minor);
        printf("  Total Memory: %.2f GB\n", prop.totalGlobalMem / 1e9);
        printf("  Multiprocessors: %d\n", prop.multiProcessorCount);
        printf("  Max Threads/Block: %d\n", prop.maxThreadsPerBlock);
        printf("  Shared Memory/Block: %zu bytes\n", prop.sharedMemPerBlock);
        printf("  Warp Size: %d\n", prop.warpSize);

        // Check feature support
        if (prop.major >= 7) {
            printf("  Supports Tensor Cores\n");
        }
        if (prop.concurrentKernels) {
            printf("  Supports Concurrent Kernels\n");
        }
    }
    return 0;
}

Using Properties for Configuration:

cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);

// Adaptive block size
int threadsPerBlock = prop.maxThreadsPerBlock > 512 ? 512 : 256;

// Check shared memory availability
if (prop.sharedMemPerBlock >= 49152) {
    // Use 48KB shared memory configuration
}

Usage & Best Practices

When to Query Properties

Startup: Validate GPU meets minimum requirements
Configuration: Calculate optimal launch parameters
Feature Detection: Check for specific capabilities (e.g., double precision)
Multi-GPU: Select appropriate device for workload

Best Practices

Query once at startup, cache results
Validate compute capability for required features
Use properties to size shared memory and registers
Check maxThreadsPerBlock before kernel launch

Common Mistakes

Avoid: Hardcoding launch parameters without checking limits
Avoid: Assuming all GPUs support same features

Key Takeaways

Summary:

cudaGetDeviceProperties() provides comprehensive GPU information
Compute capability indicates architecture generation and features
Memory, thread, and block limits vary across GPU models
Use properties to check requirements and optimize configurations
Query multiprocessor count for performance estimation
Feature flags indicate capability support (concurrent kernels, etc.)

Quick Reference

Basic Query:

cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, deviceId);

Essential Properties:

`name`	GPU model name (string)
`major`, `minor`	Compute capability
`totalGlobalMem`	Total device memory (bytes)
`sharedMemPerBlock`	Shared memory per block (bytes)
`maxThreadsPerBlock`	Maximum threads per block
`maxThreadsDim[3]`	Max threads in each dimension
`maxGridSize[3]`	Max blocks in each dimension
`multiProcessorCount`	Number of SMs
`warpSize`	Threads per warp (always 32)
`clockRate`	GPU clock rate (kHz)
`concurrentKernels`	Concurrent kernel support (bool)

Compute Capabilities:

5.x: Maxwell
6.x: Pascal
7.x: Volta/Turing
8.x: Ampere
9.x: Hopper

Validation Example:

if (prop.major < 6) {
    printf("Error: Requires Pascal or newer\n");
    exit(1);
}

References:

CUDA Official Documentation

Go back to CUDA tutorials.

Learn Parallel Programming