I analyzed 24.7 million original tweets from all verified accounts and created a word frequency csv that can be used for a baseline comparison against individual accounts. This file contains the top 250,000 words, total and frequency.
files.pushshift.io/twitter/twitte
#datascience
Conversation
Replying to
It adheres closely to a power law (Zipfian) distribution. Interestingly there is an inflection point in the thousands rank that if used to split the data would allow for two different exponents (for ranks ~<5000, and ~>5000). It would be great to hear some hypotheses on why.
3
3
Replying to
awesome graph -- great work! That is super interesting.
Replying to
Zipf curves are still one of the most fascinating aspects of language for me. en.wikipedia.org/wiki/Zipf%27s_
1
1
Replying to
Also, I did this in the past for Reddit over a larger set. The file is here if you are interested in plotting it: files.pushshift.io/misc/Reddit_on
1

