Currently ingesting 5,000 tweets per minute (7,200,000 per day. I'll let this run for four weeks to collect ~ 200,000,000 tweets that are politically based. Then we'll see just how many bots are messing up Twitter.
#datascience #bigdata #datasets
Conversation
Replying to
Can you estimate the size of the dataset? Is it something you could make publicly available? Host it on for example? :)
1
1
I can make it available for "academic use" so just say it's for academic study. Once it is complete, I'll make an announcement with a link to the data. Also, if you know of a source where I can get a list of all Twitter accounts for all senators, reps and governors, that ...
2
1
Oh, I didn't answer your original question. My guess is that the dataset would be between 20 and 40 gigs compressed. Somewhere around 250 gigs uncompressed?
1
That's slightly bigger than our current limit (10 Gigs), but you could either split it into two dataset or hit me up and I can help you set up an organization that can upload bigger datasets.
2
You and I and as a whole should get in touch -- I have some really awesome datasets coming up in about two weeks along with the monthly Reddit dumps.


