@cmuratori That was why. Spliting these calcs into scalars was what allowed the code motion to happen and made it much faster.
-
-
Replying to @rygorous
@cmuratori It had nothing to do with inlining. As I already said then. :)1 reply 0 retweets 0 likes -
Replying to @rygorous
@cmuratori That's also why your measured throughput is higher than the value you get from the clock count - several of these ops do not...1 reply 0 retweets 0 likes -
Replying to @rygorous
@cmuratori actually happen per pixel in the measured loop, since they are hoisted.1 reply 0 retweets 0 likes -
Replying to @rygorous
@cmuratori Eh, why your estimated total is higher than the observed clock count.1 reply 0 retweets 0 likes -
Replying to @rygorous
@cmuratori Nothing to do with issuing in parallel. The throughput already takes that into account.2 replies 0 retweets 0 likes -
Replying to @cmuratori
@rygorous So if you have 10 adds and 10 muls, you could do them simultaneously, right, because they issue to different units.1 reply 0 retweets 1 like -
Replying to @cmuratori
@rygorous So the throughput doesn't "take that into account", in that sense... you're just saying that _for the mul_ itself?1 reply 0 retweets 0 likes -
Replying to @cmuratori
@rygorous Anyway I will go look for hoistables so we can write it that way - I didn't realize we were still missing some :/1 reply 0 retweets 0 likes
@rygorous And thanks for the bilinear tip, etc.!!
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.