How to lose data: 1. A problem process eats disk space 2. Your email alert threshold is at 10% free 3. Your paging (wake me up) threshold is at 5% free 4. The ext4 reserved blocks are the default 5%. Woke up to FS at 5.01% free, a pile of <10% alert emails, and lost data.
-
-
The cronjob didn't actually fail, it was stuck doing backlog cleanup of months of *metadata* for hours, because the service it speaks to is crap and takes forever. I found out about the broken metadata cleanup yesterday and fixed it... But that took longer than expected.
-
Also, the metadata cleanup seems to have triggered some races causing data not to be properly staged for moving to another FS by another cronjob. Basically this whole thing's a mess and I hate it, but the final running out of disk space unnoticed issue was my fault.
- Show replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.