matmul is just lots of dot products.
matmul(x, y) is "every row of x dot-product its corresponding column in y".
you can express:
matmul(x, y)
as lots of dot products:
((x.unsqueeze(-2).expand(-1, y.size(-1), -1)) * y.T).sum(-1)
momentarily uses x.size(-1) times more memory tho
ERA-Solver seems to absolutely slap. Its FID at 10 steps is lower than what the other samplers achieved at 100 steps. Note: their FID measures only the *older* DPM-Solvers; DPM-Solver++ is only measured in context of computation time. DEIS not mentioned at all.
I should clarify: I don't mean that attention got 50% faster. I mean that *image generation* got 50% faster; attention is one part of that.
On Mac I measured that attention accounts for just under half of where the Unet spent its time, for 512x512 images.
I also wanna check out pytorch's built-in flash attention and see how it compares to xformers.
compiled pytorch 2.0 alpha with USE_FLASH_ATTENTION=ON; will give it a spin soon
one of the first things I want to do now that I have CUDA is try out flash attention.
xformers gave me a 50% speed increase (non-rigorous measurement).
doesn't support attention masks though (except if you have an A100)
here's how I encrypted my home directory on Ubuntu with ZFS.
at login: password authenticates user *and* decrypts the home directory.
once all sessions log out: home directory is unmounted, encrypted again.
works over SSH too
it actually worked!
you can split floats into integer mantissae and exponents, and compute matmul via elementwise products.
exponents can be multiplied cheaply (it's just addition!)
new datatype for ML training?
https://gist.github.com/Birch-san/4f4945f219aa0118712a3f2fc619eba2…
it actually worked!
you can split floats into integer mantissae and exponents, and compute matmul via elementwise products.
exponents can be multiplied cheaply (it's just addition!)
new datatype for ML training?
https://gist.github.com/Birch-san/4f4945f219aa0118712a3f2fc619eba2…
added wacky undocumented batching support to diffusers-play, to get ready to make animations on the new GPU.
took me a while to work out PyTorch (MPS) was broken rather than my CFG implementation.
will try out CUDA soon though!
https://github.com/Birch-san/diffusers-play/pull/3…
I actually prefer the undersampled image to the converged image.
this one is the same seed with a 15-step Karras schedule running the full gamut of sigmas, rho=7.
k-diffusion DPM-Solver++ (2M) with (2S) LMS warmup
: I can now build my first CUDA machine learning rig (after 7 months of learning ML on Mac)
looking forward to trying flash attention and Triton!
happy to be contributing to open-source ML R&D 💖https://pcpartpicker.com/user/Birchlabs
the comparison above measured commit 7f4cf84 (kulinseth/master) as 10% faster than commit acab0ed (origin/master).
15 steps DPM-Solver++ (2M) (k-diffusion), float16 Unet, float32 sampling. average time taken to generate 10 1-image batches, CFG enabled.
https://github.com/kulinseth/pytorch…
okay so latest PyTorch Mac speedup is good and all (12% faster since Dec 23, as discussed) but kulinseth's branch is faster even than that.
15 Unet steps 10.76->9.76s; avg of 10
10% faster!
#stablediffusion
#stablediffusion on Mac got 12% faster! (12 Unet steps 12.03->10.76s; avg of 10)
upgraded pytorch 789b143...acab0ed (Dec 23–Jan 6)
wonder if it's due to razarmehr's torch.linear() optimizations (made BERT 3x faster)
https://github.com/pytorch/pytorch/issues/91737…
more to come when macOS updates!
torch.linear() optimization for Mac just merged; makes BERT 3x faster (GPU faster than CPU for first time).
upcoming changes to Ventura will enable more matmuls to be optimized.
looking forward to benchmarking #stablediffusion.
https://github.com/pytorch/pytorch/pull/91114…