How to lose data: 1. A problem process eats disk space 2. Your email alert threshold is at 10% free 3. Your paging (wake me up) threshold is at 5% free 4. The ext4 reserved blocks are the default 5%. Woke up to FS at 5.01% free, a pile of <10% alert emails, and lost data.
-
-
Replying to @marcan42
Is there an "ohshitohfuck" filesystem, that starts moving the least-surprising data over to another drive, or compressing data when that happens? Ideally, moving it all back later when space frees.
1 reply 0 retweets 2 likes -
Replying to @mhlkong
So what caused this problem was actually the cronjob in charge of copying data to another filesystem and deleting old data not working...
1 reply 0 retweets 6 likes -
Replying to @marcan42
oof! Only suggestion I've gotten about this on linux machines is to use separate mounts for each application, so at least it can't kill the box. :/ Failures from that kind of cronjob should probably be higher priority. Maybe after a few in a row. Like server 911, or something.
1 reply 0 retweets 0 likes -
Replying to @mhlkong
This *was* a separate mount, so it indeed only killed the app. But the app was storing data that only comes in once, so several hours of data are now lost.
1 reply 0 retweets 0 likes -
The sad thing is there was 1.5TB of wasted space on that filesystem, because it used to be the only storage for the app, but then I switched to automatically moving data elsewhere... But I left the old files around when I did so months ago, intended to delete them, never did :/
1 reply 0 retweets 1 like -
The cronjob didn't actually fail, it was stuck doing backlog cleanup of months of *metadata* for hours, because the service it speaks to is crap and takes forever. I found out about the broken metadata cleanup yesterday and fixed it... But that took longer than expected.
3 replies 0 retweets 0 likes -
Also, the metadata cleanup seems to have triggered some races causing data not to be properly staged for moving to another FS by another cronjob. Basically this whole thing's a mess and I hate it, but the final running out of disk space unnoticed issue was my fault.
1 reply 0 retweets 0 likes
Basically this thing has bit me in the ass several times, and I already have several wake me up level alerts for stuff going wrong. It just found *another* way to fail undetected :(
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.