I analyzed 24.7 million original tweets from all verified accounts and created a word frequency csv that can be used for a baseline comparison against individual accounts. This file contains the top 250,000 words, total and frequency.
files.pushshift.io/twitter/twitte
#datascience
Conversation
Replying to
It adheres closely to a power law (Zipfian) distribution. Interestingly there is an inflection point in the thousands rank that if used to split the data would allow for two different exponents (for ranks ~<5000, and ~>5000). It would be great to hear some hypotheses on why.
3
3
Replying to
Also, I'm at a complete loss for the change in direction around 5,000. I'd have to think on that for a bit. Great observation.
1
1
Replying to
My first stab would be that it might have something to do with the presence of distinct interest groups, and that these groups may also adhere to a power law distribution in terms of their size. This would mean that the first 5000 words reflect most of the common shared words.
2
1
Replying to
Also, there is a chunk of verified accounts that are support accounts (like Verizon, Delta, etc.) for big companies that have standardized "We're sorry you experienced blah blah" -- I wonder if collectively they are giving additional weight to more common words.
How hard is it to produce a log log plot of the number of unique *tweets*? It would be interesting to see the top 20 most tweeted things even if it's a single 😂


