Mailchimp's Mandrill App Suffers Service Outage, Company Says. "Transaction ID Wraparound issue", and the database "in read-only mode until offline maintenance (known as vacuuming) can occur" mediapost.com/publications/a
Conversation
I've seen things like this happen before. Usually something like a buggy hand-rolled upsert procedure (with improper handling of duplicate violation) is involved. That can burn through transaction IDs very quickly, without a true spike in writes.
1
1
5
That speculation could be wrong, of course -- it's just something I've observed during similar production outages. Even if I'm right, that's not a great failure mode.
1
1
I would guess that they either disabled autovacuum, or had some other operation (like manual locking) going on that was preventing autovacuum from completing.
1
Sounds very plausible. In cases like this, it's often a combination of something like a TRUNCATE cron job, and anti-wraparound autovacuum (not regular av). Often an unlucky confluence of issues that are individually not very problematic. Good example: joyent.com/blog/manta-pos
1
Why did somebody write a cron job to periodically TRUNCATE the events table? To stop getting those annoying anti-wrapraround vacuums every few weeks!
I've seen a cron job that explicitly queried pg_stat_activity, and then did a pg_terminate_backend() on "(to prevent xid wraparound)" autovacuums because they were "causing so much I/O."
1
2
That's something that I haven't seen, and cannot one-up :-)
1
Show replies



