Mailchimp's Mandrill App Suffers Service Outage, Company Says. "Transaction ID Wraparound issue", and the database "in read-only mode until offline maintenance (known as vacuuming) can occur" mediapost.com/publications/a
Conversation
I've seen things like this happen before. Usually something like a buggy hand-rolled upsert procedure (with improper handling of duplicate violation) is involved. That can burn through transaction IDs very quickly, without a true spike in writes.
1
1
5
That speculation could be wrong, of course -- it's just something I've observed during similar production outages. Even if I'm right, that's not a great failure mode.
1
1
I would guess that they either disabled autovacuum, or had some other operation (like manual locking) going on that was preventing autovacuum from completing.
1
Sounds very plausible. In cases like this, it's often a combination of something like a TRUNCATE cron job, and anti-wraparound autovacuum (not regular av). Often an unlucky confluence of issues that are individually not very problematic. Good example: joyent.com/blog/manta-pos
1
Why did somebody write a cron job to periodically TRUNCATE the events table? To stop getting those annoying anti-wrapraround vacuums every few weeks!
1
I've seen a cron job that explicitly queried pg_stat_activity, and then did a pg_terminate_backend() on "(to prevent xid wraparound)" autovacuums because they were "causing so much I/O."
1
2
That's something that I haven't seen, and cannot one-up :-)
Yeah, same here. It's one of the new versions of the good old "replication stops working when my scheduled job to drop and recreate tables (to make sure they are efficient) runs" from back in the trigger replication days
1
2
Show replies





