Which part? The 10gbs or the ECC RAM?
-
-
-
Replying to @BuschnicK
Since we cannot do large RAM surveys (we're a small company), we rely on multiple papers from ~2010 (including by your employer :) that detail the failure patterns and frequency of bit errors in RAM modules.
1 reply 0 retweets 1 like -
Replying to @cmuratori @BuschnicK
Based on the rates observed, it was very clear that the extra cost of ECC RAM more than pays for itself, since fault-tolerant software in the face of random bit errors is very CPU-intensive, and honestly I'm not sure it can even be done in all cases.
1 reply 0 retweets 1 like -
Replying to @cmuratori
Sure. It just seemed like a curious thing to ask for explicitly. I thought there might have been some rationale behind it in terms of application needs.
1 reply 0 retweets 0 likes -
Replying to @BuschnicK
One non-obvious thing that happens is that the more efficient your program is, the worse non-ECC is. Fatty systems have a lot more bits that don't actually do anything. The more optimized your system is, the more likely it is that a random bit error is dangerous.
1 reply 0 retweets 2 likes -
Replying to @cmuratori @BuschnicK
A typical production system today has so much garbage code in it, that there is a massive bit surface where any particular bit flip does not actually do anything but potentially crash the interpreter / exe.
1 reply 0 retweets 1 like -
Replying to @cmuratori @BuschnicK
A very efficient system is a very tiny bit of code and a very large amount of data, and if that data is all tight and well-organized, a very large portion of it is "important".
1 reply 0 retweets 0 likes -
Replying to @cmuratori @BuschnicK
You then have a problem whereby you have to start modeling what will happen should a bit flip affect any number of difficult-to-correct things, and you have to implement a lot of, for example, hash trees across basically everything.
1 reply 0 retweets 0 likes -
Replying to @cmuratori @BuschnicK
Because literally any control structure in the entire system that is not hash-verified is now a critical failure point.
1 reply 0 retweets 1 like
With ECC, it seems to me that you can be much more selective about what you verify and what you don't, because the undetected error rates are vastly more favorable.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.