Conversation

Replying to
My colleagues developed tiny-cuda-nn, a self-contained framework written in CUDA for training and deploying "fully fused" MLPs. It is able to speed up NeRF-style research and apps dramatically. Here's an example on training a 2D rendering function: (x,y) -> (R,G,B) 2/
Image
2
29
In neural rendering, MLPs are typically narrow (e.g. only 64 hidden units). This means their weights can fit into GPU registers, and the intermediate activations can fit in shared memory! With some CUDA magic, MLPs can be fully fused and run on GPUs with staggering speed. 3/
Image
1
24
Tiny-CUDA-NN has nice a nice C++ API and python bindings for PyTorch. It also natively supports a ton of different input encodings, losses, and optimizers, all in CUDA land! I know firsthand how painful it is to debug CUDA, so mad respect for the authors! 4/
Image
1
34
Replying to
It would be interesting to replace ALU with a simpler approximate calculation unit (analogue?). And get additional chip space and faster execution with less accuracy. But that is out of a lot of other use cases.