Currently ingesting 5,000 tweets per minute (7,200,000 per day. I'll let this run for four weeks to collect ~ 200,000,000 tweets that are politically based. Then we'll see just how many bots are messing up Twitter.
#datascience #bigdata #datasets
Conversation
Replying to
Can you estimate the size of the dataset? Is it something you could make publicly available? Host it on for example? :)
1
1
I can make it available for "academic use" so just say it's for academic study. Once it is complete, I'll make an announcement with a link to the data. Also, if you know of a source where I can get a list of all Twitter accounts for all senators, reps and governors, that ...
2
1
Oh, I didn't answer your original question. My guess is that the dataset would be between 20 and 40 gigs compressed. Somewhere around 250 gigs uncompressed?
1
That's slightly bigger than our current limit (10 Gigs), but you could either split it into two dataset or hit me up and I can help you set up an organization that can upload bigger datasets.
2
As for licensing, I am not a lawyer, but you might be able to release the data under a license similar to the one use for OMNeT++, which stipulates academic use only. (Although of course you won't want to reserve the right to monetize the data!)
1
2
IANAL, either -- but the licensing would probably be a pass-thru from the originating source for the data as I act as an aggregator essentially. So whatever licencing applies at the source would apply to the datasets but if the data has to be locked to academic use only in ...
... some situations to appease the source, then I'm all for it. The mission is to keep data open for research opportunities without profiting off of it directly. I saw directly because I still accept donations for the work involved in aggregation.
1


