Store forwarding characteristics for Golden Cove and Zen 3. Both can forward w/ 0 cycle latency if addresses match. If load's contained in store, Zen 3 takes a ~2 extra cycles vs GLC, and a higher penalty if forwarding fails. See blog.stuffedcow.net/2014/01/x86-me for test methodology
Conversation
Also, Zen 3 suffers misaligned access penalties at 32 byte boundaries. Intel takes these penalties if 64 byte cache line boundaries are crossed. Zen 4 (from the Gigabyte leak) should change natural L1D alignment to 64 bytes, likely as part of AVX-512 support
1
2
8
Apple M1/Firestorm, and Ampere Altra/N1. M1: no zero-latency forwarding, but no expensive failure case if the load and store partially overlap (!) N1: very limited store forwarding, only works if ld is width-aligned with the store, similar to Tremont/Gracemont
Replying to
I don't understand what you are claiming here.
Apple have true 0-cycle latency IF certain conditions are met, most importantly that the store and subsequent load use the same register as base address
There's also a slightly different version that gives ZCL off SP-based load/store
2
Replying to
I've validated these really is ZCL (happens in Rename). But it's difficult to validate because if you use a naive loop you land up gated by the subsequent checking of the speculation; you need to include a delay like a DIV to counteract that.
1
2
As for the Golden Cove version, are you claiming a TRUE ZCL that happens at Rename? I assume that can't be based on address match, so it's based on register pattern match like Apple?
1
1
Show replies
Replying to
I just discovered this
patents.google.com/patent/US11175 (late 2020 so probably post M1, maybe in A15).
Basic idea is work done to reduce the load/store dependency latency, in at least some cases, from 7 cycles to 3 :-)
2
I don't know if would help your test case (or how much), the exact case it's trying to improve is when a store address arrives substantially earlier than store data.
But it shows this is still an issue Apple is working on.
Worth retrying as soon as you have access to an M2!

