@cmuratori For example, the "0.33" rcp througput for PANDs in the Intel reference is not "takes 1/3 of a cycle", it's "can issue 3/cycle".
-
-
Replying to @rygorous
@cmuratori Likewise, Agner Fog gives 0.5 cycle rcp throughput for some instrs. That's "can issue 2 of these per cycle". But not 2 per port!1 reply 0 retweets 0 likes -
Replying to @rygorous
@cmuratori There's 3 ports that can run integer ADD, and it has a rcp throughput of 1/3, meaning you can *actually* issue 3 adds/cycle.1 reply 0 retweets 0 likes -
Replying to @rygorous
@cmuratori But as soon as you're tracking work at the port level, the actual throughput is 1. (Each port can do *one* of these per clock.)1 reply 0 retweets 0 likes -
Replying to @rygorous
@cmuratori So in your throughput calc, everything that multiplies by 1/3 throughput implicitly assumes perfect parallelism already. :)1 reply 0 retweets 0 likes -
Replying to @rygorous
@rygorous@cmuratori so i guess some of the sse instructions were optimized away since the sum was higher than the actual timings?1 reply 0 retweets 0 likes -
Replying to @andsz
@rygorous@cmuratori well... also,@cmuratori didn't rerun the calculation after fixing the counting of and's in the q&a. that might be it2 replies 0 retweets 0 likes -
Replying to @andsz
@andsz@cmuratori No, the calculation is just wrong on several axes.1 reply 0 retweets 0 likes -
Replying to @rygorous
@andsz@cmuratori 1. it's counting instructions that are actually loop-invariant and hoisted out in the real optimized build;2 replies 0 retweets 0 likes
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.