I can’t tell if this is a crazy amount of storage space for 30 days of Twitter firehose or not.
Conversation
Replying to
49 tebibytes / (30 * 750 million) ~= 2.4kb per tweets.
That seems a bit high, but within the realm of reason.
3
5
They have user uploaded images and videos so it's not surprising that there's a lot more than a tiny bit of text with a bunch of metadata.
They might even include the scraped OpenGraph content, etc. if it actually matches what's available from the public site.
There’s probably a lot of metadata encoded in the Thrift payload, if everything must be replayable from there.
The list of user IDs of people mentioned, maybe cached translations?
It’s a bit unclear whether they dropped a Thrift object in S3, or whether they enabled API rights.
1



