How to lose data: 1. A problem process eats disk space 2. Your email alert threshold is at 10% free 3. Your paging (wake me up) threshold is at 5% free 4. The ext4 reserved blocks are the default 5%. Woke up to FS at 5.01% free, a pile of <10% alert emails, and lost data.
-
-
Replying to @marcan42
Is there an "ohshitohfuck" filesystem, that starts moving the least-surprising data over to another drive, or compressing data when that happens? Ideally, moving it all back later when space frees.
1 reply 0 retweets 2 likes -
Replying to @mhlkong
So what caused this problem was actually the cronjob in charge of copying data to another filesystem and deleting old data not working...
1 reply 0 retweets 6 likes -
Replying to @marcan42
oof! Only suggestion I've gotten about this on linux machines is to use separate mounts for each application, so at least it can't kill the box. :/ Failures from that kind of cronjob should probably be higher priority. Maybe after a few in a row. Like server 911, or something.
1 reply 0 retweets 0 likes -
Replying to @mhlkong
This *was* a separate mount, so it indeed only killed the app. But the app was storing data that only comes in once, so several hours of data are now lost.
1 reply 0 retweets 0 likes -
The sad thing is there was 1.5TB of wasted space on that filesystem, because it used to be the only storage for the app, but then I switched to automatically moving data elsewhere... But I left the old files around when I did so months ago, intended to delete them, never did :/
1 reply 0 retweets 1 like -
The cronjob didn't actually fail, it was stuck doing backlog cleanup of months of *metadata* for hours, because the service it speaks to is crap and takes forever. I found out about the broken metadata cleanup yesterday and fixed it... But that took longer than expected.
3 replies 0 retweets 0 likes -
Silent cronjob failures have broken my stuff too many times. I built an HTTP service which updates a key in redis with a timestamp, then use Prometheus to scrape matching keys, and alert if `now() - timestamp > threshold`. Run a curl in cron scripts on success, problem solved!
1 reply 0 retweets 0 likes
There are much easier ways of doing that. Either use the prometheus pushgateway, or use the textfile collector in node_exporter and just dump timestamps to files in your cronjobs. Been there done that :)
-
-
True! My biggest annoyance is the lack of real authentication in Pushgateway. If I have to run an authenticating reverse proxy, I may as well just write my own service and have client TLS or token based auth. node_exporter is great, but less feasible for Kubernetes CronJobs. :)
1 reply 0 retweets 0 likes -
Replying to @Frank_Petrilli @marcan42
I already run nginx in front of so many things, it's almost always easier for me to do that. Super setut-dependant, I guess.
0 replies 0 retweets 0 likes
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.