How to lose data: 1. A problem process eats disk space 2. Your email alert threshold is at 10% free 3. Your paging (wake me up) threshold is at 5% free 4. The ext4 reserved blocks are the default 5%. Woke up to FS at 5.01% free, a pile of <10% alert emails, and lost data.
-
-
Replying to @marcan42
Is there an "ohshitohfuck" filesystem, that starts moving the least-surprising data over to another drive, or compressing data when that happens? Ideally, moving it all back later when space frees.
1 reply 0 retweets 2 likes -
Replying to @mhlkong
So what caused this problem was actually the cronjob in charge of copying data to another filesystem and deleting old data not working...
1 reply 0 retweets 6 likes -
Replying to @marcan42
oof! Only suggestion I've gotten about this on linux machines is to use separate mounts for each application, so at least it can't kill the box. :/ Failures from that kind of cronjob should probably be higher priority. Maybe after a few in a row. Like server 911, or something.
1 reply 0 retweets 0 likes -
Replying to @mhlkong
This *was* a separate mount, so it indeed only killed the app. But the app was storing data that only comes in once, so several hours of data are now lost.
1 reply 0 retweets 0 likes -
The sad thing is there was 1.5TB of wasted space on that filesystem, because it used to be the only storage for the app, but then I switched to automatically moving data elsewhere... But I left the old files around when I did so months ago, intended to delete them, never did :/
1 reply 0 retweets 1 like -
The cronjob didn't actually fail, it was stuck doing backlog cleanup of months of *metadata* for hours, because the service it speaks to is crap and takes forever. I found out about the broken metadata cleanup yesterday and fixed it... But that took longer than expected.
3 replies 0 retweets 0 likes -
Replying to @marcan42
Have you considered triggering it with inotify or something? So something runs when a file is dropped, or the FS has X free space left. Though really, this is just moving the problem.
1 reply 0 retweets 0 likes
Eh, it's fine if it runs hourly. The problem here was the alerting; this was a case of "something was broken by consequence of human action and a ham should fix it".
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.