Opens profile photo
Follow
Click to Follow dougallj
Dougall
@dougallj
du.glJoined February 2007

Dougall’s Tweets

Another correction: I was wrong about every instruction being an entry point. I updated the post to reflect my current understanding. See the "Usually one-to-one translation" section. I also added 2 new "Update" sections, describing the two known inter-instruction optimisations.
5
Show this thread
Correction: I've discovered one more inter-instruction optimisation: prologue and epilogue combining, equivalent to the "stack engine" in hardware implementations of x86. This pairs loads and stores, and delays stack-pointer updates. Example on mastodon:
1
14
Show this thread
Some good news from earlier that I kept around for a rainy day: the crc32c_soft has been augmented with crc32c_arm64_hw and crc32c_x86_hw, both shipping in the public release of macOS Ventura. Their implementations are what you’d expect them to be.
Quote Tweet
The expectation: M1 is so fast because Apple can optimize their software to take full advantage of the hardware! The reality:
Show this thread
Lightly retouched Ghidra decompilation of crc32_soft from APFS.framework, clearly showing that the function runs over the input byte-by-byte rather than using a vectorized or hardware accelerated implementation.


uint32_t _crc32c_soft(uint32_t result,uint8_t *buffer,size_t count)

{
    if (count != 0) {
        do {
            result = _crc32c_table[result & 0xff ^ (uint)*buffer] ^ result >> 8;
            count -= 1;
            buffer = buffer + 1;
        } while (count != 0);
    }
    return result;
}
1
26
Great thread of further reading on tristate numbers (values with unknown bits), and integer range analysis, including a proof I was wondering about in my "Addition with Unknown Bits" blog post.
Quote Tweet
Just found this great intro post to tristate numbers by @dougallj: dougallj.wordpress.com/2020/01/13/bit They can be used in compiler optimizations to track the bits of an integer variable that a compiler knows the value of (the remaining bits are unknown).
Show this thread
7
Wasmer just started sponsoring Mold Keep up the good work! Mold is a great piece of software, let's aim to make it self-sustained
Quote Tweet
I was optimistic when I started the mold project that I'd be able to earn a comfortable income in some way if it becomes popular. But I may have to admit that that's a bit too optimistic. I'm still losing my money after two years.
Show this thread
1
198
Hello Xcode team I am back from retirement to report bugs that followed me across jobs. You call this function 53 million times on 30k unique files. Each time it iterates over the UTF8View of a bridged string. Fanning out over 8 cores is cute but please use caching (FB11698739)
Instruments time profile trace of XCBBuildService.

9.13 min  100.0%    0 s         XCBBuildService (36179)
[thread junk]
7.21 min   78.9%    0 s          closure #8 in BuildPlan.init(planRequest:taskPlanningDelegate:)
7.06 min   77.3%    0 s           protocol witness for TaskProducer.generateTasks() in conformance HeadermapTaskProducer
7.06 min   77.3%    0 s            HeadermapTaskProducer.generateTasks()
6.96 min   76.2%    0 s             specialized HeadermapTaskProducer.constructAllProjectHeadersMap(_:_:)
6.15 min   67.2%    31.00 ms         FilePathResolver.resolveAbsolutePath(_:)
6.14 min   67.1%    23.00 ms          FilePathResolver.computeAbsolutePath(_:)
5.68 min   62.1%    28.00 ms           Path.join(_:preserveRoot:normalize:)
4.82 min   52.7%    60.00 ms            Path._isNormalized.getter
2.31 min   25.3%    2.37 s               Substring.index(after:)
2.20 min   24.0%    1.53 s                _StringGuts._foreignOpaqueCharacterStride(startingAt:)
[string junk]
7
425
Show this thread
And approximate ROB NOP capacities (measured to the nearest ten by NOPs with LDR latency): Firestorm: 2300, Icestorm: 420 Avalanche: 2040, Blizzard: 520 Everest: 1880, Sawtooth: 750 (NOP capacity ≈ 7 × coalescing capacity, so these numbers seem to make sense) (cc )
10
Show this thread
Rough Apple CPU coalescing ROB sizes (measured by LDRs with FSQRT latency): M1: Firestorm: 333, Icestorm: 59 A15: Avalanche: 293, Blizzard: 74-79 (weird graph) A16: Everest: 270, Sawtooth: 108 (Though Sawtooth's ROB behaviour has some changes - maybe branches use two slots?)
2
31
Show this thread
The Apple M1 return-address prediction stacks are large: 50 entries on Firestorm and 32 on Icestorm. But unlike Intel and AMD, this gets cleared on overflow, leading to a surprising performance cliff. Probably easy to avoid, but something to watch out for.
Icestorm graph showing showing clock-cycles per iteration as "nested call depth" increases from 1 to 40. There are three lines. The first, labeled "call-ret", gradually increases from 5 cycles at 1-nested-call to 74-cycles at 32-nested calls, then jumps up to 553 cycles at 33-nested calls, then continues to increase at the same gradient as before. The second line, labeled "mispredicted", sits just above the first line, trailing by about 11 to 14 cycles (i.e. one mispredict penalty). The third line, labeled "misaligned" is much steeper, proceeding roughly linearly from 5 to 553 cycles at 33-nested-calls. Inexplicable it goes a little higher at 31-nested-calls and 32-nested calls, before dipping back to meets the first two lines, then continuing at the same gradient as before.
Firestorm graph showing showing clock-cycles per iteration as "nested call depth" increases from 1 to 60. There are three lines. The first, labeled "call-ret", gradually increases from 5 cycles at 1-nested-call to 102-cycles at 50-nested calls, then jumps up to 900 cycles at 51-nested calls, then continues to increase at the same gradient as before. The second line, labeled "mispredicted", sits just above the first line, trailing by 14 cycles (i.e. one mispredict penalty). The third line, labeled "misaligned" is much steeper, proceeding roughly linearly from 4 to 900 cycles at 51-nested-calls. Inexplicable it goes a little higher from 48 to 50-nested calls, before dipping back to meets the first two lines, then continuing at the same gradient as before.
4
163
Show this thread
Fun SVE2 trick: HISTSEG can be used backwards as a sparse byte-to-small-int mapping. e.g. simdjson classifies operators ",:[]{}" and whitespace " \t\r\n". It could use svhistseg(data, dup16(",,::[[]]{{}} \t\r\n")) to map operators to 2, whitespace 1, and everything else to 0.
1
14
Can we get prefix-sum instructions in SVE? They needn't be fast, but I'm doing a vector-length-agnostic 64-bit prefix-exclusive-sum in 11 to 26 instructions (TBL vs REV+EXT). VL-specific 64-bit prefix-exclusive-sum on current hardware (all 128-bit, as I use bitperm) is 1 op.
Screenshot of an SVE implementation of Lemire's "despace" function. Prefix sum takes 11 (out of 23) instructions in the inner loop, 5 constant registers, and 5 setup instructions.

<despace>:
cbz x2, 7ac <despace+0x98>
mov x4, x0
mov x3, #0x0 // #0
adrp x5, 0 <_Z10print_masku10__SVBool_t>
ptrue p1.b
add x5, x5, #0x0
index z2.d, #-1, #1
ld1rw {z16.s}, p1/z, [x5]
index z7.d, #-2, #1
index z6.d, #-4, #1
index z5.d, #-8, #1
index z4.d, #-16, #1
nop
whilelo p0.b, x3, x2
ld1b {z1.b}, p0/z, [x1, x3]
nmatch p0.b, p0/z, z1.b, z16.b
mov z3.b, p0/z, #-1
mov z0.b, p0/z, #1
bgrp z1.d, z1.d, z3.d
cnt z0.d, p1/m, z0.d
tbl z3.d, {z0.d}, z2.d
add z0.d, z0.d, z3.d
tbl z3.d, {z0.d}, z7.d
add z0.d, z0.d, z3.d
tbl z3.d, {z0.d}, z6.d
add z0.d, z0.d, z3.d
tbl z3.d, {z0.d}, z5.d
incb x3
add z0.d, z0.d, z3.d
tbl z3.d, {z0.d}, z4.d
add z0.d, z0.d, z3.d
tbl z0.d, {z0.d}, z2.d
st1d {z1.d}, p1, [x4, z0.d]
incp x4, p0.b
cmp x2, x3
b.hi 748 <despace+0x34> // b.pmore
sub x0, x4, x0
ret
mov x0, #0x0 // #0
ret
1
16
Show this thread