@marcan42 oh gods that is a horrifying ratio of memory reads to arithmetic
-
-
Replying to @FioraAeterna
@marcan42 I look at https://github.com/marcan/cl-waifu2x/commit/89176cc52b99ea7bdb0b9b71589379f3d751fe95#diff-d2a065162d2048b6908eee284e028eafR24 … and recoil in horror1 reply 0 retweets 2 likes -
Replying to @FioraAeterna
@marcan42 are those at least in local memory, or are these all global reads... *sob*1 reply 0 retweets 0 likes -
Replying to @FioraAeterna
@FioraAeterna Global... I've been trying to use local memory to reuse data but everything I try makes it *slower*. *sob*2 replies 0 retweets 0 likes -
Replying to @marcan42
@FioraAeterna This is the horrible thing it's trying to compute (well, the worst step). 128 ins, 128 outs, full mesh.pic.twitter.com/aOpHDYoL0z
1 reply 0 retweets 2 likes -
Replying to @marcan42
@FioraAeterna The "kernels" are distinct 3x3 convolutions. So there are 128x128 independent 3x3 kernels, for each in,out pair.1 reply 0 retweets 0 likes -
-
Replying to @FioraAeterna
@marcan42 ... oh nevermind, that's 589824 bytes of coefficients. gods help you1 reply 0 retweets 1 like
@FioraAeterna Yeah, I tried a thing where I loaded some pixels and some coefs to shared mem to optimize for reuse, but it was MUCH slower.
-
-
Replying to @FioraAeterna
@FioraAeterna Yeah, I think I had 120 workitems per workgroup sharing local mem, which I guess might not have been enough?1 reply 0 retweets 0 likes - Show replies
New conversation
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.