I analyzed 24.7 million original tweets from all verified accounts and created a word frequency csv that can be used for a baseline comparison against individual accounts. This file contains the top 250,000 words, total and frequency.
files.pushshift.io/twitter/twitte
#datascience
Conversation
Replying to
It adheres closely to a power law (Zipfian) distribution. Interestingly there is an inflection point in the thousands rank that if used to split the data would allow for two different exponents (for ranks ~<5000, and ~>5000). It would be great to hear some hypotheses on why.
3
3
Replying to
Also, I'm at a complete loss for the change in direction around 5,000. I'd have to think on that for a bit. Great observation.
Replying to
My first stab would be that it might have something to do with the presence of distinct interest groups, and that these groups may also adhere to a power law distribution in terms of their size. This would mean that the first 5000 words reflect most of the common shared words.
2
1
Replying to
Also, there is a chunk of verified accounts that are support accounts (like Verizon, Delta, etc.) for big companies that have standardized "We're sorry you experienced blah blah" -- I wonder if collectively they are giving additional weight to more common words.
1
2
Show replies

