@FioraAeterna 3² × 128² coefficients, yes.
-
-
Replying to @FioraAeterna
@marcan42 ... oh nevermind, that's 589824 bytes of coefficients. gods help you1 reply 0 retweets 1 like -
Replying to @FioraAeterna
@FioraAeterna Yeah, I tried a thing where I loaded some pixels and some coefs to shared mem to optimize for reuse, but it was MUCH slower.1 reply 0 retweets 1 like -
Replying to @FioraAeterna
@FioraAeterna Yeah, I think I had 120 workitems per workgroup sharing local mem, which I guess might not have been enough?1 reply 0 retweets 0 likes -
Replying to @FioraAeterna
@marcan42 ... but you might already be maxing out bandwidth and be doomed anyways ;-;1 reply 0 retweets 0 likes -
Replying to @FioraAeterna
@FioraAeterna Well unless the CUDA version does magic there's clearly *some* way of speeding this up...2 replies 0 retweets 0 likes -
Replying to @marcan42
@marcan42@FioraAeterna cuDNN doesn't do direct convolution http://devblogs.nvidia.com/parallelforall/cudnn-v2-higher-performance-deep-learning-gpus/ … - maybe API/docs mention some more implementation detail1 reply 0 retweets 0 likes
@DrDaxxy @FioraAeterna http://arxiv.org/pdf/1410.0759.pdf … Found the paper. Reading time!
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.