Conversation

I need to create a word frequency database in order to to compare groups of data (for instance, one user's tweets / youtube comments) to extract words / phrases that are used more often for that particular user / group compared to the universal word usage frequency. Do you think
4
8
it matters if I use tweets or Reddit comments or Youtube comments to generate the universal word frequency table? Would restrictions on the length of a comment / tweet influence the usage frequency of certain words? Would it make sense to combine Reddit, Youtube comments and
2
3
Replying to
A lot of great advice here -- thank you! What I'm trying to do is create a universal word frequency table so that we can then scan users to see which users are using terms / words at a much higher frequency than you would expect by using the universal set. For instance, if you
1
2
search for users that use the word "vaccine(s)", you will get X users back and many of them are just using it occasionally (perhaps because there is a discussion of vaccines). But what I want to do is pull out users that use the term much more often than what the global frequency
1
2
table would have. Then you can detect users who are either purposely trying to spread disinfo, find users who are experts in the topic and like to talk about it, find users who have a strong stance on the topic matter, etc.
2
1
Just because a user uses a term X times doesn't mean they're discussing that topic much more frequently than others -- they just might be prolific and write a lot of comments / tweets -- but if you can calculate the frequency of the use compared to their entire corpus, then you
1
Replying to
The distributions are going to be different. There isn't a universal reference distribution. The fact that each writer (or gaggle of writers) has a different distribution is used to determine the authorship of documents (i.e. in history, who wrote this anonymous document)