I have a CNN image classification C++ program that uses LibTorch(Pytorch)/CUDA to improve the inference performance. I try using Callgrind:
valgrind --tool=callgrind ./my-bin
to profile it and find potential bottlenecks. My call graph now looks like this:
My previous understanding is that callgrind may exclude some minor execution branches so that the percentages of branch's execution time would almost never add up to 100%, but it should be close (something like 90%). But in this call graph, the major branches only take less than 50% of total execution time. Is this huge miss expected?
My previous unfounded guess: if the program delegates computation to GPU, callgrind should either
- count that as execution time; or
- exclude that from both numerator and denominator of the call graph.
But seems somehow this is not the case? Any thoughts?
