I am currently working on a very large word frequency distribution table using over one billion Reddit comments. This is more challenging than I originally thought since I have to do a lot of filtering to remove bots from the word source since bots like automoderator will inflate
Conversation
Replying to
the frequency of use for certain words. The first iteration of this project will just use all words lowercased.
I'll be putting up a sample soon. Sorry if you have messaged me and I haven't had a chance to respond yet.
5
Replying to
yeah i've run into weird data artifacts working with reddit comments for exactly this reason. i think i grouped by comment text to find the bot accounts posting identical text (maybe just first/last N chars to try to catch the templates).
1
they are also often top posters by count when grouping by subreddit.
looking forward to seeing what you release!
Replying to
Also producing a raw frequency list? Or just distribution? Could be useful as a reference list for keyword analysis
Replying to
how are you identifying bots, and would you be willing to publish a list of bot accounts you detect? would be helpful to many I’m sure :)
1





