Conversation

Also, Zen 3 suffers misaligned access penalties at 32 byte boundaries. Intel takes these penalties if 64 byte cache line boundaries are crossed. Zen 4 (from the Gigabyte leak) should change natural L1D alignment to 64 bytes, likely as part of AVX-512 support
1
8
Replying to
I don't understand what you are claiming here. Apple have true 0-cycle latency IF certain conditions are met, most importantly that the store and subsequent load use the same register as base address There's also a slightly different version that gives ZCL off SP-based load/store
2
Replying to
I've validated these really is ZCL (happens in Rename). But it's difficult to validate because if you use a naive loop you land up gated by the subsequent checking of the speculation; you need to include a delay like a DIV to counteract that.
1
2
Show replies
Replying to and
I don't know if would help your test case (or how much), the exact case it's trying to improve is when a store address arrives substantially earlier than store data. But it shows this is still an issue Apple is working on. Worth retrying as soon as you have access to an M2!