How would I go about monitoring a particular process's execution (namely, its branches, from the Branch Trace Store) using the Intel Performance Counter monitor, while filtering out other process's information?
Intel Performance Monitor -- any way to monitor per-process?
2.6k views Asked by user541686 At
3
There are 3 answers
4
Crashworks
On
We were forced to build our own instrumenting profiler that reads the MSRs directly to get this information. The Performance Counter Monitor's source code demonstrates how to build a kernel driver that reads them.
Previously we used VTune, but it crashes when run on our app. (When we tried OProfile on the Linux version, it actually crashed the entire kernel and forced us to power-cycle the machine, which was pretty funny.)
1
PhilKo
On
Check out https://github.com/andikleen/pmu-tools/blob/master/toplev.py
Examples: toplev.py -l2 program measure whole system in level 2 while program is running
Related Questions in C
- How to call a C language function from x86 assembly code?
- What does: "char *argv[]" mean?
- User input sanitization program, which takes a specific amount of arguments and passes the execution to a bash script
- How to crop a BMP image in half using C
- How can I get the difference in minutes between two dates and hours?
- Why will this code compile although it defines two variables with the same name?
- Compiling eBPF program in Docker fails due to missing '__u64' type
- Why can't I use the file pointer after the first read attempt fails?
- #include Header files in C with definition too
- OpenCV2 on CLion
- What is causing the store latency in this program?
- How to refer to the filepath of test data in test sourcecode?
- 9 Digit Addresses in Hexadecimal System in MacOS
- My server TCP doesn't receive messages from the client in C
- Printing the characters obtained from the array s using printf?
Related Questions in PERFORMANCE
- Upsert huge amount of data by EFCore.BulkExtensions
- How can I resolve this error and work smoothly in deep learning?
- Efficiently processing many small elements of a collection concurrently in Java
- Theme Preloader for speed optimization in WordPress
- I need help to understand the time wich my simple ''hello world'' is taking to execute
- Non-blocking state update
- Do conditional checks cause bottlenecks in Javascript?
- Performance of sketch drastically decreases outside of the P5 Web Editor
- sample query for review for improvement on big query
- Is there an indexing strategy in Postgres which will operate effectively for JOINs with ORs
- Performance difference between two JavaScript code snippets for comparing arrays of strings
- C++ : Is there an objective universal way to compare the speed of iterative algorithms?
- How to configure api http request with load testing
- the difference in terms of performance two types of update in opensearch
- Sveltekit : really long to send the first page and intense CPU computation
Related Questions in WINAPI
- How to immediately apply DISPLAYCONFIG_SCALING display scaling mode with SetDisplayConfig and DISPLAYCONFIG_PATH_TARGET_INFO
- Changing the theme of a #32768 (menu) window class at runtime
- Issue with GetOpenFileName while debugging
- How to populate a ListBox with SendMessage?
- Is there a function to end a child process?
- HDR video publishing
- Frameless Qt + WinAPI maximized window size is bigger than the availableGeometry()
- Mount .iso file with python
- What is Win32 x86-64 CONTEXT::VectorRegister for?
- WinAPI - right mouse drag & drop and IContextMenu
- Win32 per-filesystem cache tuning?
- Client connection timeout during Android & Windows PC communication via sockets
- MessageBoxEx sometimes shows as hollow window, border only, and only on Windows 11
- Win32api send message and Pydirectinput and Powertoy (Keyboard Manager ) Not working when open the application
- Would it be possible to run an application right after csrss.exe loads? (Windows)
Related Questions in PERFORMANCECOUNTER
- Issue with PDH Counter returning incorrect values after upgrading to Windows 11
- Begin tracing program after some number of instructions have executed
- What does the event `stall_slot_backend` represent?
- Which instructions are included in the hardware event `INST_SIMD_ALU`?
- Frequent Cache misses for loading data and accumulating Elements of std vector
- Perf: how to display cache miss percentage from raw counters
- Create Dummy CPU Performance counter register for Unit Testing of Driver
- CPU performance counters in C++ (Mac/PC, Intel)
- missing counters IDs in typeperf on non english windows
- No PAPI counters available on Ubuntu 20.04
- Profiling a Linux kernel code snippet in Intel VTune
- Can rdpmc be used to read the fixed-function counters on AMD?
- Cant get Jna to recognize and set registry
- Memory Leak in a Rust Program
- Task Manager and Hyper-V Logical Processor/%Total Run Time Comparison
Related Questions in INTEL-PMU
- What is the microcode scoreboard?
- How to get the number of instructions executed inside a virtual machine from the host?
- Why does it need to be divided by 9 when calculating UPI bandwidth on the Intel platform using UNC_UPI_TxL_FLITS.ALL_DATA event?
- How to count the number of data loaded into the cache but not used?
- How to count offcore PMU events on an old kernel?
- How to collect DMA events (like memory write and read) by perf tool?
- Getting count of TLB misses that resulted in memory access in x86-64
- Why don’t current tools support collecting memory bandwidth usage data at process granularity?
- Is mmap() faster than read() for perf_event_open
- How do different monitoring tools calculate memory bandwidth?
- Workload Memory Bandwidth Comparison Inconsistency
- Can rdpmc be used to read the fixed-function counters on AMD?
- Measuring ITLB_FLUSH on icelake processors
- Which instruction is recorded when an overflow of PMC occurs?
- Don't all loads result in an L1 cache hit (after the data arrives if it initially missed)?
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
You should know that BTS (Branch trace store) and Performance monitoring events/counters (inside CPU, its PMU block) are very different things.
The Branch Trace Store is function of CPU when it does record every taken branch (pairs of eip - first of branch instruction and second of branch target; there is also a word of flags added to each pair) in special area of memory. Result of it is very like to Single-stepping and recording order of executed code blocks (basic blocks). It is just like doing code coverage with assistance from compiler, when every branch is instrumented by compiler.
BTS is just a bit in the MSR_DEBUGCTLA MSR (it is intel x86 register); I'm almost sure that this register is thread-specific (as it is in Linux), so you need no to hook scheduler. There is some examples of working with this MSR in windows; but different bit is used. Also, don't forget to set DS_AREA correctly. So, if you really want BTS, take a copy of Intel Arch Manual (Volume 3b, Part "Debugging and Performance monitoring", section "19.7.8 Branch Trace Store (BTS)") and program BTS manually. Hardest part is to handle DS area overflow (you need custom interrupt handler).
If you want to know not a trace of executed code but statistics of you program (how much instructions executed; how well was branches predicted; how much indirect branches are here ...), you should use Performance monitoring events aka "Precise Event Based Sampling" (PEBS). Intel Vtune does this; there should be some other tools, even the Intel PBS your linked. The only problem (this is bit more difficult with free tools) is to find name of Events you want. Events based on instruction execution are always binded to some thread.
What does event-based sampling means: you can set some limit, e.g. 1000 for some event, eg. BR_INST_EXEC.COND ("number of conditional near branch instructions executed") or BR_INST_EXEC.DIRECT ("all unconditional near branch instructions excluding calls and indirect branches."), up to 2-4 events at once. Then CPU will count every situation which correspond to this event. When there will be 1000th situation, the Event (interrupt) will be generated for instrution EIP. With sampling it is easy to get detailed statistics of your code behaviour. If you will set limit to something very low and if you will not sum events for eip, you will get trace ;)
With PEBS you can know how bad is your code for the CPU, where mispredicted branches are located, which instructions wait data from cache, etc. There are 100s of events (appendix A of Volume 3b).
PS there is some code for BTS/win: http://blog.csdn.net/quincy_hu/article/details/4053163
PPS there is shorter overview of PMU programming, both PEBS and BTS. software.intel.com/file/30320 It is for Nehalem, but it can be actual even for Sandy.