@marcan42 ... oh nevermind, that's 589824 bytes of coefficients. gods help you
-
-
Replying to @FioraAeterna
@FioraAeterna Yeah, I tried a thing where I loaded some pixels and some coefs to shared mem to optimize for reuse, but it was MUCH slower.1 reply 0 retweets 1 like -
Replying to @FioraAeterna
@FioraAeterna Yeah, I think I had 120 workitems per workgroup sharing local mem, which I guess might not have been enough?1 reply 0 retweets 0 likes -
Replying to @FioraAeterna
@marcan42 ... but you might already be maxing out bandwidth and be doomed anyways ;-;1 reply 0 retweets 0 likes -
Replying to @FioraAeterna
@FioraAeterna Well unless the CUDA version does magic there's clearly *some* way of speeding this up...2 replies 0 retweets 0 likes -
Replying to @FioraAeterna
@FioraAeterna It computes the same thing, but I don't know how it divides up the big mesh (input-major, output-major, etc.)1 reply 0 retweets 0 likes
@FioraAeterna The problem is it's a mesh, so you get to pick what to inner loop on and reread everything else, or use a bit of everything.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.