The new #pushshift API that will soon be out of beta makes detecting bots on social media a breeze. Comment data from the first week of August was used to aggregate and find accounts with the lowest median reply delays to other comments.
#datascience #dataviz #reddit
Conversation
Replying to
Interesting heuristic to detect bots! Have you done any evaluation on it yet?
1
1
Replying to
Thank you Tim! I have not gotten to that stage. My main focus at the moment is getting the new API ready for prime-time, improving the documentation for the Pushshift API and making sure all the checks pass. I do want get back to bot detection research but that will probably
1
have to wait for a bit. If you would be willing, I could share the data with you and give you some of the preliminary code I've written for this. To prevent falling into the trap of data wrangling, I am going to split the data into two pieces and apply tests to the second piece
I would very much appreciate your expertise in this -- perhaps you could swing over to my data science Slack workspace and I can show you the new dataviz endpoints!
Replying to
Sounds really interesting! I am completely snowed under at the moment, especially this time of semester, so I can't get involved in any more project work at this stage. But happy to chat more and offer any assistance I can with e.g., feature engineering for bot detection
1
Replying to
(Off the top of my head) - there are a few other heuristics for classifying bot accounts that might be helpful, such as:
(1) ratio of friends to followers;
(2) average number of posts/comments per t unit of time;
(3) mean lexical complexity of text content such as comments
Replying to
Also: maybe you could extract bot labels from self-identified bots (in their username) and community-identified bots (e.g., via reddit.com/r/BotBust/), and use such a labelled dataset to evaluate your bot detection approach?
1

