I should have the #kavanaugh Twitter dataset ready for download on Friday. I am doing some sanity checks on the data and prepping it for the dump. I'll have a more accurate number for the amount of tweets in a day or two.
#datasets #bigdata #datascience
Conversation
The current size of the dataset uncompressed is around 375 GB of tweets. That should be over 100 million tweets.
2
1
Replying to
Will there be handles or bios attached to each? I was thinking I would use keywords like maga or resist, etc in those to act as a kind of a negative/positive labeling method.
1
1
I have a tweet classification neural network I built as a side project basically ready to go for it. Except it was built for 140 char tweets, so I'll need a few adjustments.
1
2
Replying to
No, I just need to make some topology design decisions. It uses convolution and a GRU recurrent layer. Just a question of what adjustments to make. The code changes will be easy.
1
1
Replying to
Does this program do sentiment analysis on data? Do you have a website showing some previous work that was done using this program? I am very interested in learning more! Thank you!
Replying to
1/ It classified tweets as either hate speech, offensive speech, or neither, based on over 10k tweets labeled as such by volunteers and experts (I'm not sure what made the experts experts but whatevs). It's on my github but the repo is a complete mess, I need to fix it up.
1
end/ I stopped working on it when I started my new job, but now that I'm comfortable in my role I've been meaning to restart work on it. This data dump is a great place to restart. I'm on your slack data channel, we could talk about it on there.

