Callgrind for CUDA program, execution time in percentages do not add up to 100%

28 views Asked by D.J. Elkind At 07 January 2024 at 04:34

I have a CNN image classification C++ program that uses LibTorch(Pytorch)/CUDA to improve the inference performance. I try using Callgrind：

valgrind --tool=callgrind ./my-bin

to profile it and find potential bottlenecks. My call graph now looks like this:

My previous understanding is that callgrind may exclude some minor execution branches so that the percentages of branch's execution time would almost never add up to 100%, but it should be close (something like 90%). But in this call graph, the major branches only take less than 50% of total execution time. Is this huge miss expected?

My previous unfounded guess: if the program delegates computation to GPU, callgrind should either

count that as execution time; or
exclude that from both numerator and denominator of the call graph.

But seems somehow this is not the case? Any thoughts?

Original Q&A

TechQA.

Callgrind for CUDA program, execution time in percentages do not add up to 100%

There are 0 answers

Related Questions in VALGRIND

Related Questions in LIBTORCH

Related Questions in CALLGRIND

Popular Questions

Trending Questions