Trying to optimize my OpenCL code for GPU. My "GPU-optimized" version runs a bit faster on CPU and much slower on GPU. Fail.
@FioraAeterna Global... I've been trying to use local memory to reuse data but everything I try makes it *slower*. *sob*
-
-
@FioraAeterna This is the horrible thing it's trying to compute (well, the worst step). 128 ins, 128 outs, full mesh.pic.twitter.com/aOpHDYoL0z
-
@FioraAeterna The "kernels" are distinct 3x3 convolutions. So there are 128x128 independent 3x3 kernels, for each in,out pair. - Show replies
New conversation -
-
-
@marcan42 I can't be totally sure, but I worry this sort of kernel (low-compute, high-memory) is inherently slow on GPUs ;_; -
@FioraAeterna Thing is, the original CUDA version is faster. I want to at least match that (to within reason). But it uses a blob lib.
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.