How to lose data: 1. A problem process eats disk space 2. Your email alert threshold is at 10% free 3. Your paging (wake me up) threshold is at 5% free 4. The ext4 reserved blocks are the default 5%. Woke up to FS at 5.01% free, a pile of <10% alert emails, and lost data.
The cronjob didn't actually fail, it was stuck doing backlog cleanup of months of *metadata* for hours, because the service it speaks to is crap and takes forever. I found out about the broken metadata cleanup yesterday and fixed it... But that took longer than expected.
-
-
Also, the metadata cleanup seems to have triggered some races causing data not to be properly staged for moving to another FS by another cronjob. Basically this whole thing's a mess and I hate it, but the final running out of disk space unnoticed issue was my fault.
-
Basically this thing has bit me in the ass several times, and I already have several wake me up level alerts for stuff going wrong. It just found *another* way to fail undetected :(
End of conversation
New conversation -
-
-
Silent cronjob failures have broken my stuff too many times. I built an HTTP service which updates a key in redis with a timestamp, then use Prometheus to scrape matching keys, and alert if `now() - timestamp > threshold`. Run a curl in cron scripts on success, problem solved!
-
There are much easier ways of doing that. Either use the prometheus pushgateway, or use the textfile collector in node_exporter and just dump timestamps to files in your cronjobs. Been there done that :)
- Show replies
New conversation -
-
-
Have you considered triggering it with inotify or something? So something runs when a file is dropped, or the FS has X free space left. Though really, this is just moving the problem.
-
Eh, it's fine if it runs hourly. The problem here was the alerting; this was a case of "something was broken by consequence of human action and a ham should fix it".
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.