Profiling your OpenACC code on a remote system can be tricky sometimes. Many times we try to profile the code in cluster environment where we need to use a job scheduler to submit our jobs. In such scenarios, command line based profiling comes handy.
This tutorials provides some usage examples for NVIDIA’s command line profiler – nvprof
Following is the basic usage of nvprof
nvprof ./a.out
I have shared sample profiling output for one of the OpenACC code here –
==6405== NVPROF is profiling process 6405, command: ./a.out
Value of pi is : 3.141593
Execution time : 0.3913819789886475 seconds
==6405== Profiling application: ./a.out
==6405== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.95% 112.21ms 1 112.21ms 112.21ms 112.21ms pi_calc_17_gpu
0.05% 50.912us 1 50.912us 50.912us 50.912us pi_calc_17_gpu__red
0.00% 3.2960us 1 3.2960us 3.2960us 3.2960us [CUDA memset]
0.00% 1.6960us 1 1.6960us 1.6960us 1.6960us [CUDA memcpy DtoH]
API calls: 50.56% 132.90ms 1 132.90ms 132.90ms 132.90ms cuDevicePrimaryCtxRetain
42.71% 112.28ms 3 37.425ms 5.1090us 112.26ms cuStreamSynchronize
6.40% 16.819ms 1 16.819ms 16.819ms 16.819ms cuMemHostAlloc
0.14% 374.98us 1 374.98us 374.98us 374.98us cuMemAllocHost
0.08% 222.65us 3 74.216us 6.7530us 111.74us cuMemAlloc
0.07% 173.70us 1 173.70us 173.70us 173.70us cuModuleLoadDataEx
0.01% 25.777us 2 12.888us 6.1510us 19.626us cuLaunchKernel
0.01% 22.514us 1 22.514us 22.514us 22.514us cuMemcpyDtoHAsync
0.00% 11.347us 1 11.347us 11.347us 11.347us cuStreamCreate
0.00% 8.9010us 1 8.9010us 8.9010us 8.9010us cuMemsetD32Async
0.00% 3.6610us 1 3.6610us 3.6610us 3.6610us cuDeviceGetPCIBusId
0.00% 3.3190us 1 3.3190us 3.3190us 3.3190us cuEventRecord
0.00% 2.6890us 2 1.3440us 413ns 2.2760us cuEventCreate
0.00% 2.2660us 3 755ns 298ns 1.2790us cuCtxSetCurrent
0.00% 2.1290us 3 709ns 269ns 1.2060us cuDeviceGetCount
0.00% 1.7620us 5 352ns 237ns 454ns cuDeviceGetAttribute
0.00% 1.6410us 1 1.6410us 1.6410us 1.6410us cuPointerGetAttributes
0.00% 1.1130us 1 1.1130us 1.1130us 1.1130us cuEventSynchronize
0.00% 1.0150us 2 507ns 248ns 767ns cuModuleGetFunction
0.00% 940ns 2 470ns 216ns 724ns cuDeviceGet
0.00% 297ns 1 297ns 297ns 297ns cuDeviceComputeCapability
0.00% 285ns 1 285ns 285ns 285ns cuCtxGetCurrent
0.00% 263ns 1 263ns 263ns 263ns cuDriverGetVersion
OpenACC (excl): 86.63% 112.28ms 2 56.139ms 9.2280us 112.27ms acc_wait@pi_calc.f90:17
13.01% 16.860ms 1 16.860ms 16.860ms 16.860ms acc_exit_data@pi_calc.f90:17
0.15% 198.14us 1 198.14us 198.14us 198.14us acc_device_init@pi_calc.f90:17
0.09% 121.69us 1 121.69us 121.69us 121.69us acc_compute_construct@pi_calc.f90:17
0.04% 50.844us 1 50.844us 50.844us 50.844us acc_enter_data@pi_calc.f90:17
0.03% 41.963us 1 41.963us 41.963us 41.963us acc_enqueue_download@pi_calc.f90:23
0.02% 23.683us 1 23.683us 23.683us 23.683us acc_enqueue_launch@pi_calc.f90:17 (pi_calc_17_gpu)
0.01% 11.535us 1 11.535us 11.535us 11.535us acc_wait@pi_calc.f90:23
0.01% 10.588us 1 10.588us 10.588us 10.588us acc_enqueue_upload@pi_calc.f90:17
0.01% 7.7160us 1 7.7160us 7.7160us 7.7160us acc_enqueue_launch@pi_calc.f90:17 (pi_calc_17_gpu__red)
0.00% 0ns 1 0ns 0ns 0ns acc_delete@pi_calc.f90:23
0.00% 0ns 1 0ns 0ns 0ns acc_alloc@pi_calc.f90:17
0.00% 0ns 1 0ns 0ns 0ns acc_create@pi_calc.f90:17
If we want to visualize nvprof profiling output using NVIDIA’s visual profiler (nvvp), we can store the profiling output to a file.
nvprof -o profile_output.nvvp ./a.out
NVIDIA’s Visual Profiler (‘nvvp’) can be used for opening this profiler output (‘profile_output.nvvp’) file (File -> Open).
Also, we can perform detailed analysis for a specific kernel using following command where ‘pi_calc_17_gpu’ is the name of the kernel we are trying to get detailed analysis on. Also, please note that as it is trying to generate detailed analysis for the given kernel, it will take considerable amount of time for following command.
nvprof -o profile_output_detailed.nvvp --analysis-metrics --kernels pi_calc_17_gpu ./a.out
As mentioned before, this ‘profile_output_detailed.nvvp’ file can be opened using NVIDIA’s Visual Profiler – ‘nvvp’. For visualization of the all the additional details (that we just collected), we will have to follow following steps –
Step 1:
Step 2:
Step 3:
Step 4 (optional):
At the end of the step 4, we will have a PDF report generated for the detailed analysis of the given kernel.
Note: NVVP and NVPROF are deprecated and will not be supported in future versions of CUDA toolkit. Users are recommended to migrate to new set of NVIDIA Developer tools. These tools can be downloaded separately or can be downloaded as a part of NVIDIA HPC SDK.