so I learned that x64 specifically doubles the number of XMM registers (from SSE). and AVX-512 doubles that again, but only in x64 mode, I think? so if you want I assume you can write AVX-512 code which is cursed because it's in x86 and has a mere 8 registers to play with
-
Show this thread
-
Replying to @hikari_no_yume
You can also use MMX registers as 64-bit temp vars with 0 latency instead of storing them on the stack.
1 reply 0 retweets 1 like -
Replying to @fast_code_r_us
other Andrea ☀️ not actually Matsuri Retweeted Steve Canon
in theory the latency is better but I wouldn't be shocked if there's CPUs where it isn't (see https://twitter.com/stephentyrone/status/1214906610063151104 … which is a similar thing)
other Andrea ☀️ not actually Matsuri added,
1 reply 0 retweets 0 likes -
Replying to @hikari_no_yume @fast_code_r_us
Right; e.g. on AMD Bobcat GPR <-> MMX is 7 cycles latency each way, much slower than going to the stack.
2 replies 0 retweets 2 likes -
Does touching MMX still obligate you to `emms` at some point?
1 reply 0 retweets 0 likes -
Yuuuup. But fortunately there's no reason to use MMX, ever, on hw that support SSE2.
2 replies 0 retweets 0 likes -
Well I was thinking the need to `emms` kills a lot of MMX's theoretical value as a spilling ground
1 reply 0 retweets 0 likes -
Oh, definitely. You should never use MMX. But the perf issues apply equally to using SSE as spilling ground.
2 replies 0 retweets 3 likes -
Replying to @stephentyrone @jckarter and
It’s weird how scalar operations are fast, HW F32x4 is fast (SSE2), HW F32x8 is fast (AVX2), HW F32x16 is fast (AVX-512), but HW F32x2 is slow (MMX).
3 replies 0 retweets 1 like -
MMX doesn't do F32. 8, 16, and 32-bit integers.
2 replies 0 retweets 1 like
Oh right, s/F32/I32/g.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.