I have a constant (64-bit) address that I want to load into a register. This address is located in the code, segment, so it could be addressed relative to RIP. What's the differences between
movabs rax, 0x123456789abc
and
lea rax, [rip+0xFF] // relative offset for 0x123456789abc
in terms of execution-speed, and which one is preferable (in a situation where both alternatives could theoretically be used; like in a JIT or when the address can be fixed up at link-time)?
By looking at the disassembly, LEA results in less code, but would it be faster due to this; or potentially slower due to the relative encoded offset?
TL;DR: In a hot loop, the former (
movabs) is generally faster because it has a higher reciprocal throughput on most modern processors.Indeed, on Intel Haswell/Broadwell/Skylake/CoffeeLake/CannonLake/IceLake/TigerLake/RocketLake (too many of those lakes), the
movabshas a reciprocal throughput of 0.25 while it is 1 for thelea(due to therip-relative addressing).On the quite-recent Intel AlderLake hybrid architecture, things are significantly more complex. AlderLake's P-cores (GoldenCove) have a reciprocal throughput of 0.2 for the
movabsand 1 for thelea(mainly due to therip-relative addressing again). AlderLake's E-core (Gracemont) are pretty different: the reciprocal throughput for themovabsis 0.33, while it is 0.25 for thelea. This means that the best instruction to use is dependent on where the thread are scheduled! This is crazy. Even more funny : it looks like Goldmont/Tremont already had fastleawith rip-relative addressing while SunnyCove/WillowCove. This is because the architecture for the P-core and E-core are designed for different purpose (AFAIK Mon-like architectures was designed for low-power processors while Cove-like ones was designed for desktop processors). Not to mention Intel certainly hadn't initially planned to mix the two kind of architectures in the same chip.On AMD Zen1/Zen2, it is 0.25 for the
movabsand 0.5 forlea, so the former is also better. On AMD Zen3/Zen4, both have a reciprocal throughput of 0.25 so they are equally fast on this architecture.That being said, the former take more space and it is likely slower to decode than the later so the later might be better outsize a hot loop. Indeed, instructions are decoded to µops once and then put in a cache for relatively-short loops, but the decoding is typically a bottleneck for large code executed once (no hot loop or very large one and a code that may need to be fetched from the RAM or the L3).