OoO works pretty well with array random access. It can calculate address to and fetch multiple array items concurrently.
-
-
Short version is that, even if you have a workload that scales to 100 cores, that's not necessarily a very
-
power-efficient thing to do either. There's structural reasons for why communication within a core is more
-
efficient than between cores. Having lots of small cores work on disjoint data gives good power/perf. If you have
-
that kind of workload. But if there's any potential of sharing or need to communicate, things change.
-
Memory access (and more mem BW) is *crazy* expensive in terms of power. Hence, caches. But caches are only good
-
with sufficient locality.
-
Other the past 10 years, two trends: 1. small number of relatively fancy cores good at extracting parallelism out
-
of code that has good locality of reference (and then use a good cache hierarchy). 2. Large number of super-dumb
- 10 more replies
New conversation -
-
-
you can, if you are willing to sacrifice cache coherency: https://www.extremetech.com/extreme/230458-meet-the-new-worlds-fastest-supercomputer-chinas-taihulight …
-
You only need coherency at atomics, & that can be done via global flush on any atomic
-
Performance would be pretty terrible though, I think. :-P
-
yeah. one can make synchronization and inter-core communication be special cases.
-
you could also have non-coherent pages (default) and coherent pages for communication.
-
That needs a new programming model. HW imposing that = destined to fail miserably.
-
my idle thinking here was probably a "make this page sync" opcode or similar.
-
That's a radically different programming model. Locks don't work to synchronize.
- 1 more reply
New conversation -
-
-
make -j800 ftw
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
-
-
In addition to what ryg has said: interconnect explodes, memory hierarchy becomes more complex
-
your 100 super dumb cores need caches, which need snooping. interconnect needs snoop filters...
-
nah, we'll solve this at the PL level, right??
-
maybe give each one local memory and message passing; no big unified addr space?
-
AKA distributed computing. Now you have 10 problems, and that's only the ones you already know about!
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.