@FioraAeterna https://github.com/marcan/cl-waifu2x … OpenCL is hard and Nvidia makes it harder by not having a profiler...
-
-
Replying to @FioraAeterna
@marcan42 and the variable trip count loops makes me as a compiler person REALLY wish those could be constants...2 replies 0 retweets 0 likes -
Replying to @FioraAeterna
@FioraAeterna Thankfully none of that is performance-sensitive code, the OpenCL kernel is all that matters :-)1 reply 0 retweets 0 likes -
Replying to @FioraAeterna
@marcan42 oh gods that is a horrifying ratio of memory reads to arithmetic1 reply 0 retweets 2 likes -
Replying to @FioraAeterna
@marcan42 I look at https://github.com/marcan/cl-waifu2x/commit/89176cc52b99ea7bdb0b9b71589379f3d751fe95#diff-d2a065162d2048b6908eee284e028eafR24 … and recoil in horror1 reply 0 retweets 2 likes -
Replying to @FioraAeterna
@marcan42 are those at least in local memory, or are these all global reads... *sob*1 reply 0 retweets 0 likes -
Replying to @FioraAeterna
@FioraAeterna Global... I've been trying to use local memory to reuse data but everything I try makes it *slower*. *sob*2 replies 0 retweets 0 likes
@FioraAeterna Thing is, the original CUDA version is faster. I want to at least match that (to within reason). But it uses a blob lib.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.