I need to create a word frequency database in order to to compare groups of data (for instance, one user's tweets / youtube comments) to extract words / phrases that are used more often for that particular user / group compared to the universal word usage frequency.
Do you think
Conversation
Replying to
it matters if I use tweets or Reddit comments or Youtube comments to generate the universal word frequency table? Would restrictions on the length of a comment / tweet influence the usage frequency of certain words?
Would it make sense to combine Reddit, Youtube comments and
2
3
tweets to create the global word usage frequency table?
Opinions?
10
1
A lot of great advice here -- thank you! What I'm trying to do is create a universal word frequency table so that we can then scan users to see which users are using terms / words at a much higher frequency than you would expect by using the universal set.
For instance, if you
1
2
search for users that use the word "vaccine(s)", you will get X users back and many of them are just using it occasionally (perhaps because there is a discussion of vaccines). But what I want to do is pull out users that use the term much more often than what the global frequency
1
2
table would have. Then you can detect users who are either purposely trying to spread disinfo, find users who are experts in the topic and like to talk about it, find users who have a strong stance on the topic matter, etc.
2
1
Just because a user uses a term X times doesn't mean they're discussing that topic much more frequently than others -- they just might be prolific and write a lot of comments / tweets -- but if you can calculate the frequency of the use compared to their entire corpus, then you
1
shake out users that are discussing a topic much more frequently for whatever reason.
1
Yes, tf-idf is what I've used to do document similarity analysis on the past, in the absence of "universal frequency" and having only the document library to work with.
1
me three. don't have any legit nlp background, but i did "tf-idt diffs" to compare corpuses (users, communities) to find distinctive vocab differences, and it worked surprisingly well
Replying to
I often do a keyword analysis in these cases. All posts from the same platform or on the same topic as reference corpus, and all user posts as sample corpus. Log likelihood gives very good results. Not unibersal per se, but good basis for comparison



