Opens profile photo
Follow
Michaël Benesty
@pommedeterre33
Apply mathemagic to law understanding Head of R&D Former tax lawyer, CPA, financial audit Core dev ex
127.0.0.1linkedin.com/in/mbenesty/Joined January 2010

Michaël Benesty’s Tweets

So in the end, common tile sizes that work well in today's GPUs are 64x64, 64x128, or 128x128, not bigger. These tiles are hardcoded into libraries such as cutlass or cuBlas and themselves depend on low level GPUs instructions.
1
1
Show this thread
A larger tile means better data reuse and faster execution, since the usual bottleneck in the GPU is memory bandwidth, rarely compute speed. But bigger tile means more registers to use, and at some point register spilling which hurts badly performances.
1
Show this thread
If the shapes of A and B are not multiples of the "tile", you must perform boundary checks, masking, etc. Some of these constraints are heavier when tensor cores are used (vectorization requirements). -> more ops to execute, if/else & divergent branchs, etc. aka slower execution
2
Show this thread
Basically, the math for matrix multiplication is different for GPUs than what is usually taught at school. When done manually, you manipulate entire rows/columns, but on GPUs you usually multiply blocks (tiles) together to improve data reuse.
Image
1
1
Show this thread
It is very interesting that so many people are surprised by this and seem to see it as black magic. This perf boost is most likely due to "tiling", which is the way matmuls are done on GPUs. 🧵
Quote Tweet
The most dramatic optimization to nanoGPT so far (~25% speedup) is to simply increase vocab size from 50257 to 50304 (nearest multiple of 64). This calculates added useless dimensions but goes down a different kernel path with much higher occupancy. Careful with your Powers of 2.
2
36
Show this thread
Its crazy that, at 60% Model FLOPS (FP8) Utilization on H100, original GPT3 configuration can be trained in 3 days on 1024 H100s and PaLM on 12 days on 2048 H100s. That's roughly 50x lesser gpu hours for GPT3 paper 3 years back, and 9x lesser for palm released 9 months back
12
381
Intel has slashed prices as much as 20% on older generation Alder Lake microprocessors in order to clear inventories, DigiTimes reports, citing supply chain sources. It says the price of an i9 CPU has been cut by US$70-$80. $INTC
11
Let's talk about a detail that occurs during PyTorch 2.0's codegen - tiling. In many cases, tiling is needed to generate efficient kernels. Even for something as basic as torch.add(A, B), you might need tiling to be efficient! But what is tiling? And when is it needed? (1/13)
Image
9
839
Show this thread
We are hiring at the #Innovation team of Lefebvre Sarrut within R&D. We are looking for an international Product Manager You will be working directly with Manu Mateo @elsitlab Thomas Roy ... and myself! and many other colleagues of R&D and…
3
You probably heard about Yandex, it’s the 4th biggest search engine by market share worldwide. Yesterday proprietary source code of Yandex was leaked. The most interesting part for SEO community is: the list of all 1922 ranking factors used in the search algorithm [🧵THREAD]
Image
172
5,445
Show this thread