:blink: I probably have it ... somewhere ... lemme go spelunking...pic.twitter.com/lz8vZGsJ5B
-
-
Replying to @chandlerc1024
Sorry for the deep pull, but your L1 cache miss rate for the 8mb case in this talk looked way lower than what we expected. So some of us were trying to figure out what actually happened. One (wild) guess was that RNG(count) produces highly coherent indices.
1 reply 0 retweets 1 like -
Replying to @cmuratori
Possibly. I found the code. I can try to get it into a gist or something if useful. But to understand what went wrong w/ this, may be easier: its just a silly slide-code wrapper around `std::mt19937`, seeded the usual way and pushed through `std::uniform_int_distribution`.
2 replies 0 retweets 1 like -
Replying to @chandlerc1024 @cmuratori
Not sure exactly which standard library I was using, but my memory is that there have been a few bugs in the uniform distribution implementations in some. And MT isn't the best these days. But I didn't apply deep rigor to this, and so entirely possible there are other factors.
1 reply 0 retweets 0 likes -
Replying to @chandlerc1024 @cmuratori
Super interested in what you end up with?
1 reply 0 retweets 1 like -
Replying to @chandlerc1024
Will report back if we figure it out! Our only theories so far are a) bad RNG and b) the EPYC perf reading was new and flaky then. We haven't come up with anything else, except the absurd c) EPYC's prefetch predictor learned the whole random sequence (since it is repeated)
2 replies 0 retweets 1 like -
Replying to @cmuratori
(b) is at least true. Not sure that's the cause, but the counters were ... not in great shape. (c) yeah, not betting on that one....
1 reply 0 retweets 0 likes -
Replying to @chandlerc1024 @cmuratori
The period on MT is 2^19937 – 1, so we can safely put that as a "the predictor absolutely didn't store the entire period".
1 reply 0 retweets 0 likes -
Replying to @Lokathor @chandlerc1024
Well, although option c is absurd (as I said), it's not quite _that_ absurd. Chandler's 8mb benchmark runs the same series of 1,048,576 indices continuously in a loop, so the period is only 1,048,576.
1 reply 0 retweets 1 like -
Replying to @cmuratori @chandlerc1024
Mmmm. How big is the branch cache? My intuition from the talk was that the branch predictor worked in terms of "tens of branches" at most, maybe the last 16 or 32 branches taken. But also it's been 2-3 years since I saw the video.
1 reply 0 retweets 0 likes
I'm not sure what they are on the chip from the lecture (which _is_ 3+ years old). But my guess would 256 entry BTB for L1 and 4K for L2.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.