In this post, we are going to look at basic CUDA code. Even though it doesn’t necessarily prints “Hello World!”, being a very simple arithmetic operation, we will treat it as a “Hello World!” code for CUDA.
As we are aware that the discrete GPU cards have their own memory, in CUDA we need to manage two different copies (there are exceptions) of the same arrays. Hence in this code, we will learn –
- Allocate the arrays on the GPU
- Transfer the contents of the CPU array to the GPU array
- Defining a GPU kernel
- Launch the GPU kernel
- Transfer the contents of the GPU array to the CPU array
- Free the GPU array
Following is the CUDA code for adding two arrays and storing it to the third array.
#include<stdio.h>
#include<stdlib.h>
// GPU kernel function
__global__ void arradd(int* md, int* nd, int* pd, int size)
{
// Get unique thread ID within a block
int myid = threadIdx.x;
// Every thread adds one value from array "m" and "n". And stores it to array "p"
pd[myid] = md[myid] + nd[myid];
}
int main()
{
int size = 200 * sizeof(int);
int m[200], n[200], p[200],*md, *nd,*pd;
int i=0;
for(i=0; i<200; i++ )
{
m[i] = i;
n[i] = i;
p[i] = 0;
}
// Allocate memory for "m" array on GPU ("md")
cudaMalloc(&md, size);
// Copy contents of array "m" from CPU to array "md" on GPU
cudaMemcpy(md, m, size, cudaMemcpyHostToDevice);
// Allocate memory for "n" array on GPU ("nd")
cudaMalloc(&nd, size);
// Copy contents of array "n" from CPU to array "nd" on GPU
cudaMemcpy(nd, n, size, cudaMemcpyHostToDevice);
// Allocate memory for "p" array on GPU ("pd")
cudaMalloc(&pd, size);
dim3 DimGrid(1, 1); // Total number of blocks: 1 x 1 = 1
dim3 DimBlock(200, 1); // Number of threads per block: 200 x 1 = 200
// Launch the GPU kernel using "DimGrid" (number of blocks) and DimBlock (number of threads per block)
arradd<<< DimGrid,DimBlock >>>(md,nd,pd,size);
// Copy contents of array "pd" from GPU to array "p" on CPU
cudaMemcpy(p, pd, size, cudaMemcpyDeviceToHost);
// Free arrays on GPU
cudaFree(md);
cudaFree(nd);
cudaFree(pd);
for(i=0; i<200; i++ )
{
printf("\t%d",p[i]);
}
}
On Linux, to compile this code (please refer to this post for the frequently used CUDA/nvcc compiler options)-
nvcc vector_add_1D.cu
On Linux, to run this code –
./a.out
Output of this code –
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180 182 184 186 188 190 192 194 196 198 200 202 204 206 208 210 212 214 216 218 220 222 224 226 228 230 232 234 236 238 240 242 244 246 248 250 252 254 256 258 260 262 264 266 268 270 272 274 276 278 280 282 284 286 288 290 292 294 296 298 300 302 304 306 308 310 312 314 316 318 320 322 324 326 328 330 332 334 336 338 340 342 344 346 348 350 352 354 356 358 360 362 364 366 368 370 372 374 376 378 380 382 384 386 388 390 392 394 396 398
References –