Callgrind for CUDA program, execution time in percentages do not add up to 100%

28 views Asked by At

I have a CNN image classification C++ program that uses LibTorch(Pytorch)/CUDA to improve the inference performance. I try using Callgrind:

valgrind --tool=callgrind ./my-bin

to profile it and find potential bottlenecks. My call graph now looks like this:

enter image description here

My previous understanding is that callgrind may exclude some minor execution branches so that the percentages of branch's execution time would almost never add up to 100%, but it should be close (something like 90%). But in this call graph, the major branches only take less than 50% of total execution time. Is this huge miss expected?

My previous unfounded guess: if the program delegates computation to GPU, callgrind should either

  • count that as execution time; or
  • exclude that from both numerator and denominator of the call graph.

But seems somehow this is not the case? Any thoughts?

0

There are 0 answers