Danila Kutenin

@Danlark1

Software Engineer at Google C++, databases, distributed systems, low level design, Linux kernel. Ex. Yandex, Ex. EPFL, HSE alumni

Geregistreerd in oktober 2019

Tweets

Je hebt @Danlark1 geblokkeerd

Weet je zeker dat je deze Tweets wilt bekijken? @Danlark1 wordt niet gedeblokkeerd door Tweets te bekijken.

  1. 30 jun.

    It was a good side project to have which lasted for 6-7 weeks with occasional "aha!" moments. I thank and the ZSTD team for providing an amazing compression algorithm. For a full analysis, read the PR, I tried to make it easy to follow 23/23

    Deze collectie tonen
    Ongedaan maken
  2. 30 jun.

    One conclusion I want to emphasize: even super low level libraries can be optimized giving lots of impact to the world. I am pretty sure 5-10% is more hidden just in codegen of the decompression loop. For example, long mode compression has perf gap 22/n

    Deze collectie tonen
    Ongedaan maken
  3. 30 jun.

    After all that we also started to see that ZSTD decompression performance aligned for clang and gcc 21/n

    Deze collectie tonen
    Ongedaan maken
  4. 30 jun.

    Thanks to Nick Terrell who helped me to mitigate this and we just removed one branch completely. After that, all compilers agreed on not doing it worse for perf 20/n

    Deze collectie tonen
    Ongedaan maken
  5. 30 jun.

    However, after validating all compilers do not suffer, I forgot that GCC 11 existed and for some reason it was much better than all other versions of GCC and my patch made it slower, on par with GCC 10 performance 19/n

    Deze collectie tonen
    Ongedaan maken
  6. 30 jun.

    The results were impressive to us, we saw 5-9% improvements, 3% on silesia.tar. Especially laptops were benefiting the most 18/n

    Deze collectie tonen
    Ongedaan maken
  7. 30 jun.

    The last thing that helped is the block fusion, I made clang recognize similar blocks and move them into one assembly block. 17/n

    Deze collectie tonen
    Ongedaan maken
  8. 30 jun.

    Clang did even worse, it used stack twice! After shuffling code around I made it use only once as well as GCC, this gave another 2% for clang 16/n

    Deze collectie tonen
    Ongedaan maken
  9. 30 jun.

    Given we have 14 registers without rsp and rbp, it was weird to see that rax, r9 and r14 were missed during the preambula. rax and r9 were probably busy with other values. However, r14 was free. It is hard to validate but I am pretty sure this is a bug. 15/n

    Deze collectie tonen
    Ongedaan maken
  10. 30 jun.

    The last thing was to see at register allocation and it turned out that there are 13 variables that are used in the function and 4 pointers that compute the values, they can be reused. %r14 was not used and instead the stack was used (and it did not store anything) 15/n

    Deze collectie tonen
    Ongedaan maken
  11. 30 jun.

    This gave around 0.1-0.5% both for GCC and Clang though it was harder to estimate, it definitely helped to reduce the .rodata pressure which is ... at least nice 14/n

    Deze collectie tonen
    Ongedaan maken
  12. 30 jun.

    After exploring some assembly, using `& BIT_mask[i]` is better when it is in L1 cache But not if we have BMI2 instruction set. BMI2 has "bzhi" instruction which does `& ((1 << i) - 1)` and with that it becomes 1 cycle 13/n

    Deze collectie tonen
    Ongedaan maken
  13. 30 jun.

    Independently we dumped all accesses of .rodata of some binaries and found that BIT_mask is used quite frequently with (`& BIT_mask[i]`) having several percent of accessed per one request of one target. Accidentally, BIT_mask is (1 << i) - 1 for all i from 0 to 31 12/n

    Deze collectie tonen
    Ongedaan maken
  14. 30 jun.

    The second thing is more accidental rather than systematic. I found that ZSTD dynamically dispatches the calls for its decompression. And for advanced version it uses bmi2 instruction set. That proved to be reasonable, it has some nice instructions like shrx, etc. 11/n

    Deze collectie tonen
    Ongedaan maken
  15. 30 jun.

    However, I was not satisfied that the analysis and I wanted to double check if these branches are really rare. Turned out that they are. On silesia.tar and other datasets the branches were taken on average only in <1% of cases and GCC made absolutely the right choices. 10/n

    Deze collectie tonen
    Ongedaan maken
  16. 30 jun.

    Such mispredictions may hurt the performance and overall densed code placement is good, better cache, etc. You can see in picture that clang was not sampling I tried {UN}LIKELY for clang and it worked, like 2-3% of performance. But GCC began to overthink, hehe :) 9/n

    Deze collectie tonen
    Ongedaan maken
  17. 30 jun.

    First idea that worked is to find all significant branches and for some reason GCC made better decisions on how to jump from them given they were taken rarely or often. The statistics showed they really were taken the way that jumps do not happen, that was a surprise to me 8/n

    Deze collectie tonen
    Ongedaan maken
  18. 30 jun.

    Only after that I started experimenting. Overall I checked around 10-20 hypothesizes, validated the combined versions of them and found the most relevant ones. It took several days first to come up with them and then several days to validate. 2-3 weeks overall. 7/n

    Deze collectie tonen
    Ongedaan maken
  19. 30 jun.

    If you want to optimize, it's good to setup the env. For me it was to use "perf record/stat, objdump", "Intel VTune", "graph creator", "instruction tables from Agner Fog", "precompressed files", "hardware" and 7 compilers (I forgot that gcc-11 existed, explained later). 6/n

    Deze collectie tonen
    Ongedaan maken
  20. 30 jun.

    Main decompression loop roughly consists of ~1000 instructions and it was a kind of a job to look through it. I am not an advanced assembly expert as it is hard for me to have in mind all latencies for all instructions, yet, I managed to find several opportunities. 5/n

    Deze collectie tonen
    Ongedaan maken

Het laden lijkt wat langer te duren.

Twitter is mogelijk overbelast of ondervindt een tijdelijke onderbreking. Probeer het opnieuw of bekijk de Twitter-status voor meer informatie.

    Je bent misschien ook geïnteresseerd in

    ·