When running a sum loop over an array in Rust, I noticed a huge performance drop when CAPACITY >= 240. CAPACITY = 239 is about 80 times faster.
Is there special compilation optimization Rust is doing for "short" arrays?
Compiled with rustc -C opt-level=3.
use std::time::Instant;
const CAPACITY: usize = 240;
const IN_LOOPS: usize = 500000;
fn main() {
let mut arr = [0; CAPACITY];
for i in 0..CAPACITY {
arr[i] = i;
}
let mut sum = 0;
let now = Instant::now();
for _ in 0..IN_LOOPS {
let mut s = 0;
for i in 0..arr.len() {
s += arr[i];
}
sum += s;
}
println!("sum:{} time:{:?}", sum, now.elapsed());
}
Summary: below 240, LLVM fully unrolls the inner loop and that lets it notice it can optimize away the repeat loop, breaking your benchmark.
You found a magic threshold above which LLVM stops performing certain optimizations. The threshold is 8 bytes * 240 = 1920 bytes (your array is an array of
usizes, therefore the length is multiplied with 8 bytes, assuming x86-64 CPU). In this benchmark, one specific optimization – only performed for length 239 – is responsible for the huge speed difference. But let's start slowly:(All code in this answer is compiled with
-C opt-level=3)This simple code will produce roughly the assembly one would expect: a loop adding up elements. However, if you change
240to239, the emitted assembly differs quite a lot. See it on Godbolt Compiler Explorer. Here is a small part of the assembly:This is what's called loop unrolling: LLVM pastes the loop body a bunch of time to avoid having to execute all those "loop management instructions", i.e. incrementing the loop variable, check if the loop has ended and the jump to the start of the loop.
In case you're wondering: the
paddqand similar instructions are SIMD instructions which allow summing up multiple values in parallel. Moreover, two 16-byte SIMD registers (xmm0andxmm1) are used in parallel so that instruction-level parallelism of the CPU can basically execute two of these instructions at the same time. After all, they are independent of one another. In the end, both registers are added together and then horizontally summed down to the scalar result.Modern mainstream x86 CPUs (not low-power Atom) really can do 2 vector loads per clock when they hit in L1d cache, and
paddqthroughput is also at least 2 per clock, with 1 cycle latency on most CPUs. See https://agner.org/optimize/ and also this Q&A about multiple accumulators to hide latency (of FP FMA for a dot product) and bottleneck on throughput instead.LLVM does unroll small loops some when it's not fully unrolling, and still uses multiple accumulators. So usually, front-end bandwidth and back-end latency bottlenecks aren't a huge problem for LLVM-generated loops even without full unrolling.
But loop unrolling is not responsible for a performance difference of factor 80! At least not loop unrolling alone. Let's take a look at the actual benchmarking code, which puts the one loop inside another one:
(On Godbolt Compiler Explorer)
The assembly for
CAPACITY = 240looks normal: two nested loops. (At the start of the function there is quite some code just for initializing, which we will ignore.) For 239, however, it looks very different! We see that the initializing loop and the inner loop got unrolled: so far so expected.The important difference is that for 239, LLVM was able to figure out that the result of the inner loop does not depend on the outer loop! As a consequence, LLVM emits code that basically first executes only the inner loop (calculating the sum) and then simulates the outer loop by adding up
suma bunch of times!First we see almost the same assembly as above (the assembly representing the inner loop). Afterwards we see this (I commented to explain the assembly; the comments with
*are especially important):As you can see here, the result of the inner loop is taken, added up as often as the outer loop would have ran and then returned. LLVM can only perform this optimization because it understood that the inner loop is independent of the outer one.
This means the runtime changes from
CAPACITY * IN_LOOPStoCAPACITY + IN_LOOPS. And this is responsible for the huge performance difference.An additional note: can you do anything about this? Not really. LLVM has to have such magic thresholds as without them LLVM-optimizations could take forever to complete on certain code. But we can also agree that this code was highly artificial. In practice, I doubt that such a huge difference would occur. The difference due to full loop unrolling is usually not even factor 2 in these cases. So no need to worry about real use cases.
As a last note about idiomatic Rust code:
arr.iter().sum()is a better way to sum up all elements of an array. And changing this in the second example does not lead to any notable differences in emitted assembly. You should use short and idiomatic versions unless you measured that it hurts performance.