I need to create a word frequency database in order to to compare groups of data (for instance, one user's tweets / youtube comments) to extract words / phrases that are used more often for that particular user / group compared to the universal word usage frequency.
Do you think
Conversation
it matters if I use tweets or Reddit comments or Youtube comments to generate the universal word frequency table? Would restrictions on the length of a comment / tweet influence the usage frequency of certain words?
Would it make sense to combine Reddit, Youtube comments and
2
3
Replying to
A lot of great advice here -- thank you! What I'm trying to do is create a universal word frequency table so that we can then scan users to see which users are using terms / words at a much higher frequency than you would expect by using the universal set.
For instance, if you
1
2
search for users that use the word "vaccine(s)", you will get X users back and many of them are just using it occasionally (perhaps because there is a discussion of vaccines). But what I want to do is pull out users that use the term much more often than what the global frequency
1
2
table would have. Then you can detect users who are either purposely trying to spread disinfo, find users who are experts in the topic and like to talk about it, find users who have a strong stance on the topic matter, etc.
2
1
Just because a user uses a term X times doesn't mean they're discussing that topic much more frequently than others -- they just might be prolific and write a lot of comments / tweets -- but if you can calculate the frequency of the use compared to their entire corpus, then you
1
shake out users that are discussing a topic much more frequently for whatever reason.
1
Replying to
I guess my first question is "which comments?"
so many segregated communities and ways of speaking
1
Replying to
I'd say that text length is def. correlated with dictionary probabilities!
2
Replying to
Yes. But I think the larger issue you may face is that different moderation standards lead to different word choices.
1
4
Replying to
The distributions are going to be different. There isn't a universal reference distribution. The fact that each writer (or gaggle of writers) has a different distribution is used to determine the authorship of documents (i.e. in history, who wrote this anonymous document)




