I published my research on the Apple M1 CPU microarchitectures (Firestorm and Icestorm), with instruction tables describing throughput, latency, and uops for most instructions, and (an insane number of) detailed experiments and measurements.
dougallj.github.io/applecpu/fires
Dougall’s Tweets
Correction: I've discovered one more inter-instruction optimisation: prologue and epilogue combining, equivalent to the "stack engine" in hardware implementations of x86. This pairs loads and stores, and delays stack-pointer updates.
Example on mastodon:
1
1
14
Show this thread
New blog post: "Why is Rosetta 2 fast?"
dougallj.wordpress.com/2022/11/09/why
read image description
ALT
13
182
721
Show this thread
So this Mastodon security issue has been open for over five years: github.com/mastodon/masto
In totally unrelated news, I found a way to verify my Twitter account on Mastodon:
1
3
11
Some good news from earlier that I kept around for a rainy day: the crc32c_soft has been augmented with crc32c_arm64_hw and crc32c_x86_hw, both shipping in the public release of macOS Ventura. Their implementations are what you’d expect them to be.
Quote Tweet
The expectation: M1 is so fast because Apple can optimize their software to take full advantage of the hardware!
The reality:
Show this thread
read image description
ALT
1
2
26
After a long hiatus I finally wrote a new blog post:
BC1 Compression Revisited ludicon.com/castano/blog/2
3
21
74
Show this thread
New blog post: "What’s that magic computation in stb__RefineBlock?"
2
8
34
Great thread of further reading on tristate numbers (values with unknown bits), and integer range analysis, including a proof I was wondering about in my "Addition with Unknown Bits" blog post.
Quote Tweet
Just found this great intro post to tristate numbers by @dougallj: dougallj.wordpress.com/2020/01/13/bit
They can be used in compiler optimizations to track the bits of an integer variable that a compiler knows the value of (the remaining bits are unknown).
Show this thread
1
7
1
12
Wasmer just started sponsoring Mold ✨
Keep up the good work! Mold is a great piece of software, let's aim to make it self-sustained
Quote Tweet
I was optimistic when I started the mold project that I'd be able to earn a comfortable income in some way if it becomes popular. But I may have to admit that that's a bit too optimistic. I'm still losing my money after two years.
Show this thread
1
36
198
replace the 8d with a 03 and your problem is solved
1
1
18
IEEE 754 is a double standard
46
488
2,687
Show this thread
NEON shuffle instruction iceberg meme:
2
6
Rosetta support is now part of open-source XNU, yay!
4
7
110
Hello you fine Internet folks!
Today's article is microbechmarking Nvidia's new RTX 4090.
Hope y'all enjoy!
3
26
109
1
12
Discontinuities in civil time
2
18
66
Slides for 's presentation last weekend at are now available at grsecurity.net/papers Learn how your CPU really works and what recent speculative execution vulnerabilities make possible.
read image description
ALT
2
36
73
Any developers interested in talking about a subtle stability bug that I suspect is present in your drive encryption product? Send me a DM to discuss and together we can improve PC reliability
1
11
dEQP-GLES3.functional.fragment_out.basic.*
Passed: 420/420 (100.0%)
🥳
4
6
349
Today's Apple GPU open source compiler change:
30% fewer registers
10% fewer instructions
10% smaller shaders
🍎🐧😋
6
64
1,282
Show this thread
Hello you fine internet folks!
Today's article is on microbenchmarking Intel's Xe-HPG architecture in the form of the A770.
Hope y'all enjoy!
2
34
113
Hello Xcode team I am back from retirement to report bugs that followed me across jobs. You call this function 53 million times on 30k unique files. Each time it iterates over the UTF8View of a bridged string. Fanning out over 8 cores is cute but please use caching (FB11698739)
read image description
ALT
7
32
425
Show this thread
The biggest gap in the graphics APIs for GPGPU workloads
(tl;dr: the GPGPU ecosystem assumes that shared virtual memory is a baseline capability)
threedots.ovh/blog/2022/10/t
2
5
27
Hello you fine Internet Folks!
Today's article talks about Intel's longest serving architecture and its different iterations, Skylake!
Hope y'all enjoy!
1
28
111
Methodology based on Henry Wong's "Microbenchmarking Return Address Branch Prediction": blog.stuffedcow.net/2018/04/ras-mi
1
21
Show this thread
The Apple M1 return-address prediction stacks are large: 50 entries on Firestorm and 32 on Icestorm. But unlike Intel and AMD, this gets cleared on overflow, leading to a surprising performance cliff.
Probably easy to avoid, but something to watch out for.
read image description
ALT
read image description
ALT
4
20
163
Show this thread
I made a web version of the store-to-load forwarding test, based on Henry Wong's methodology. It's not a good idea, but it is oddly satisfying to watch:
dougallj.github.io/webrobsize/for
read image description
ALT
4
10
51
Sharing another of my free time projects - a cache/mem bandwidth and latency test. Supports bw testing from both instruction and data side, and plenty of other options too. Hoping this will be useful to hardware reviewers and tech enthusiasts github.com/clamchowder/Mi
15
38
162
Though this code scatters to overlapping elements, which may or may not be correct?
I couldn't find clarification in the reference manual or exploration tools. I found this in the "programming examples" (developer.arm.com/documentation/), but I'm not entirely sure how to interpret it:
read image description
ALT
1
1
Show this thread
Can we get prefix-sum instructions in SVE?
They needn't be fast, but I'm doing a vector-length-agnostic 64-bit prefix-exclusive-sum in 11 to 26 instructions (TBL vs REV+EXT).
VL-specific 64-bit prefix-exclusive-sum on current hardware (all 128-bit, as I use bitperm) is 1 op.
read image description
ALT
1
2
16
Show this thread
New blog post: "On AlphaTensor’s new matrix multiplication algorithms" fgiesen.wordpress.com/2022/10/06/on- (TL;DR these are cool from a computational complexity PoV and for quite large matrices, not at all interesting for small ones, and not meant to be.)
7
57
199

















