Yes, it’s easy to get bitten because it’s a behavior of the entire system. Not something you can catch with unit tests or even integration tests. Some common contributors to failure modes here: * queues of unbounded length * timeouts that are set too high
-
-
-
Thanks for the tips, I know this is your area! TBH I think I may even need to start a little closer to fundamentals with instrumentation. What I have built previously and what I'm trying to build now have such *vastly* different costs in terms of failing to log and watch.
- 1 more reply
New conversation -
-
-
I assume you’re not talking about vacuum systems.
-
Probably similar concept, just for data flow instead.
- 2 more replies
New conversation -
-
-
oh man. first time you encounter someone whose answer to every performance problem is "just throw it on a queue and process it asynchronously!" you realise how bad it can get...
-
For me that encounter took place with myself several weeks ago haha. "Just sprinkle Kafka on it" does not a robust system make, apparently.
End of conversation
New conversation -
-
-
Total solidarity. We're lucky on LHCb to have an awesome online team to deal with this aspect of things, especially the system monitoring.
-
I can't imagine that level of data!
- 2 more replies
New conversation -
-
-
I think this is currently something you can't get a framework for - even if you use e.g. Finagle, your team's lack of operational experience means you won't get the full benefit of it. (Please read this as "yes, it's difficult" rather than "haha you're screwed")
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.