Tweets
- Tweets, huidige pagina.
- Tweets en antwoorden
- Media
Je hebt @Danlark1 geblokkeerd
Weet je zeker dat je deze Tweets wilt bekijken? @Danlark1 wordt niet gedeblokkeerd door Tweets te bekijken.
-
It was a good side project to have which lasted for 6-7 weeks with occasional "aha!" moments. I thank
@Cyan4973 and the ZSTD team for providing an amazing compression algorithm. For a full analysis, read the PR, I tried to make it easy to follow https://github.com/facebook/zstd/pull/2689 … 23/23Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
One conclusion I want to emphasize: even super low level libraries can be optimized giving lots of impact to the world. I am pretty sure 5-10% is more hidden just in codegen of the decompression loop. For example, long mode compression has perf gap https://github.com/facebook/zstd/issues/2695 … 22/n
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
After all that we also started to see that ZSTD decompression performance aligned for clang and gcc 21/npic.twitter.com/TWhE77YNRK
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
Thanks to Nick Terrell who helped me to mitigate this and we just removed one branch completely. After that, all compilers agreed on not doing it worse for perf 20/npic.twitter.com/6ObhANtqvQ
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
However, after validating all compilers do not suffer, I forgot that GCC 11 existed and for some reason it was much better than all other versions of GCC and my patch made it slower, on par with GCC 10 performance 19/npic.twitter.com/tmtJACyv9f
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
The results were impressive to us, we saw 5-9% improvements, 3% on silesia.tar. Especially laptops were benefiting the most 18/npic.twitter.com/IFxYWWJAEL
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
The last thing that helped is the block fusion, I made clang recognize similar blocks and move them into one assembly block. 17/npic.twitter.com/9En7KUc8rQ
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
Clang did even worse, it used stack twice! After shuffling code around I made it use only once as well as GCC, this gave another 2% for clang 16/npic.twitter.com/htXf7SZJRG
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
Given we have 14 registers without rsp and rbp, it was weird to see that rax, r9 and r14 were missed during the preambula. rax and r9 were probably busy with other values. However, r14 was free. It is hard to validate but I am pretty sure this is a bug. 15/n
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
The last thing was to see at register allocation and it turned out that there are 13 variables that are used in the function and 4 pointers that compute the values, they can be reused. %r14 was not used and instead the stack was used (and it did not store anything) 15/npic.twitter.com/v8APai568x
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
This gave around 0.1-0.5% both for GCC and Clang though it was harder to estimate, it definitely helped to reduce the .rodata pressure which is ... at least nice 14/n
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
After exploring some assembly, using `& BIT_mask[i]` is better when it is in L1 cache But not if we have BMI2 instruction set. BMI2 has "bzhi" instruction which does `& ((1 << i) - 1)` https://www.felixcloutier.com/x86/bzhi and with that it becomes 1 cycle https://gcc.godbolt.org/z/cac5nr5hs 13/n
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
Independently we dumped all accesses of .rodata of some binaries and found that BIT_mask is used quite frequently with (`& BIT_mask[i]`) having several percent of accessed per one request of one target. Accidentally, BIT_mask is (1 << i) - 1 for all i from 0 to 31 12/npic.twitter.com/6qmU99m8Ba
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
The second thing is more accidental rather than systematic. I found that ZSTD dynamically dispatches the calls for its decompression. And for advanced version it uses bmi2 instruction set. That proved to be reasonable, it has some nice instructions like shrx, etc. 11/n
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
However, I was not satisfied that the analysis and I wanted to double check if these branches are really rare. Turned out that they are. On silesia.tar and other datasets the branches were taken on average only in <1% of cases and GCC made absolutely the right choices. 10/n
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
Such mispredictions may hurt the performance and overall densed code placement is good, better cache, etc. You can see in picture that clang was not sampling I tried {UN}LIKELY for clang and it worked, like 2-3% of performance. But GCC began to overthink, hehe :) 9/npic.twitter.com/YfQcK4WtnH
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
First idea that worked is to find all significant branches and for some reason GCC made better decisions on how to jump from them given they were taken rarely or often. The statistics showed they really were taken the way that jumps do not happen, that was a surprise to me 8/npic.twitter.com/9z4ig3nlOG
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
Only after that I started experimenting. Overall I checked around 10-20 hypothesizes, validated the combined versions of them and found the most relevant ones. It took several days first to come up with them and then several days to validate. 2-3 weeks overall. 7/n
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
If you want to optimize, it's good to setup the env. For me it was to use "perf record/stat, objdump", "Intel VTune", "graph creator", "instruction tables from Agner Fog", "precompressed files", "hardware" and 7 compilers (I forgot that gcc-11 existed, explained later). 6/n
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken -
Main decompression loop roughly consists of ~1000 instructions and it was a kind of a job to look through it. I am not an advanced assembly expert as it is hard for me to have in mind all latencies for all instructions, yet, I managed to find several opportunities. 5/n
Deze collectie tonenBedankt, Twitter gebruikt dit om je tijdlijn te verbeteren. Ongedaan makenOngedaan maken
Het laden lijkt wat langer te duren.
Twitter is mogelijk overbelast of ondervindt een tijdelijke onderbreking. Probeer het opnieuw of bekijk de Twitter-status voor meer informatie.