CUDA “Hello World!” : Array addition using single block

In this post, we are going to look at basic CUDA code. Even though it doesn’t necessarily prints “Hello World!”, being a very simple arithmetic operation, we will treat it as a “Hello World!” code for CUDA.

As we are aware that the discrete GPU cards have their own memory, in CUDA we need to manage two different copies (there are exceptions) of the same arrays. Hence in this code, we will learn –

  • Allocate the arrays on the GPU
  • Transfer the contents of the CPU array to the GPU array
  • Defining a GPU kernel
  • Launch the GPU kernel
  • Transfer the contents of the GPU array to the CPU array
  • Free the GPU array

Following is the CUDA code for adding two arrays and storing it to the third array.


// GPU kernel function
__global__ void arradd(int* md, int* nd, int* pd, int size)
	// Get unique thread ID within a block
	int myid = threadIdx.x;
	// Every thread adds one value from array "m" and "n". And stores it to array "p"
	pd[myid] = md[myid] + nd[myid];

int main()
	int size = 200 * sizeof(int);
	int m[200], n[200], p[200],*md, *nd,*pd;
	int i=0;

	for(i=0; i<200; i++ )
		m[i] = i;
		n[i] = i;
		p[i] = 0;
	// Allocate memory for "m" array on GPU ("md")
	cudaMalloc(&md, size);
	// Copy contents of array "m" from CPU to array "md" on GPU
	cudaMemcpy(md, m, size, cudaMemcpyHostToDevice);
	// Allocate memory for "n" array on GPU ("nd")
	cudaMalloc(&nd, size);
	// Copy contents of array "n" from CPU to array "nd" on GPU
	cudaMemcpy(nd, n, size, cudaMemcpyHostToDevice);
	// Allocate memory for "p" array on GPU ("pd")
	cudaMalloc(&pd, size);

	dim3   DimGrid(1, 1); 		// Total number of blocks: 1 x 1 = 1    
	dim3   DimBlock(200, 1);   	// Number of threads per block: 200 x 1 = 200

	// Launch the GPU kernel using "DimGrid" (number of blocks) and DimBlock (number of threads per block) 
	arradd<<< DimGrid,DimBlock >>>(md,nd,pd,size);
	// Copy contents of array "pd" from GPU to array "p" on CPU
	cudaMemcpy(p, pd, size, cudaMemcpyDeviceToHost);
	// Free arrays on GPU 

	for(i=0; i<200; i++ )

On Linux, to compile this code (please refer to this post for the frequently used CUDA/nvcc compiler options)-


On Linux, to run this code –


Output of this code –

0	2	4	6	8	10	12	14	16	18	20	22	24	26	28	30	32	34	36	38	40	42	44	46	48	50	52	54	56	58	60	62	64	66	68	70	72	74	76	78	80	82	84	86	88	90	92	94	96	98	100	102	104	106	108	110	112	114	116	118	120	122	124	126	128	130	132	134	136	138	140	142	144	146	148	150	152	154	156	158	160	162	164	166	168	170	172	174	176	178	180	182	184	186	188	190	192	194	196	198	200	202	204	206	208	210	212	214	216	218	220	222	224	226	228	230	232	234	236	238	240	242	244	246	248	250	252	254	256	258	260	262	264	266	268	270	272	274	276	278	280	282	284	286	288	290	292	294	296	298	300	302	304	306	308	310	312	314	316	318	320	322	324	326	328	330	332	334	336	338	340	342	344	346	348	350	352	354	356	358	360	362	364	366	368	370	372	374	376	378	380	382	384	386	388	390	392	394	396	398

References –