I am working on a sorting algorithm that iterates over a bunch of integers and puts them into buckets.
The exact type of the buckets is a custom data structure similar to std::vector. As you can imagine, there is a snippet similar to this one, for the case that there is already enough memory allocated in the bucket to write the element I'm adding:
*_end = _new_value; // LINE 1
++_end; // Line 2
I discovered in vtune optimizer that that LINE 1 accounts for about 1/3 of the runtime of my algorithm. I was curious if I could do better, so I started trying some stuff.
My workstation is Linux and I usually compile with gcc. Our software has to support other compilers and systems, too, but Linux-only optimizations are considered OK since we "suggest" users use Linux.
First I simply added a look-ahead to my the loop from which the above snippet is called. Looking ahead buffer_size iterations, it got the result from:
int * Bucket::get_end() {
__builtin_prefetch(_end, 1); // Line 3
return _end++; // Line 4
}
And it stored these results in a buffer similar to the following buffer:
using delayed_write = std::pair<int, int*>; // Line 5
std::array<delayed_write, buffer_size> buffer; // Line 6
I'd run the equivalent of:
*(buffer[i + buffer_size].second) = buffer[i + buffer_size].first;
This eliminated the bottleneck like I saw at line 2 in vtune, but the algorithm was slower overall. (I tried 4 and 8 as buffer_size).
I tried a few other things. In particular, I did some pretty complex stuff where I totally batched 4 or 8 integers at a time and did each step on all of them at once. I wrote code to try to look ahead to see if reallocation will be necessary; if not, I cleverly wrote some loops that avoided any data dependencies across steps of the loop. Of course all this complexity predictably made the algorithm much slower. :)
It's possible it simply can't be made faster, but I feel intuitively there should be some way to exploit that there is no data dependency on line 2's write until after the loop is over so that there's no need to wait for the likely cache miss there to be resolved.
My understanding is that a cache miss is very high-latency, but I sort of wonder why the CPU can't keep going and leave the writes in a buffer to handle asynchronously.
It'd be really cool if there were e.g. a way to promise that I'm not going to read that memory until I call some synchronization function to commit all of the writes so far.
Do you think in fact that I'm filling up the write buffer? (In which case there is no solution?)
If not, does anyone know of any ways to exploit the fact that the write will not be read until after the hot loop?