. As always, I really love the work that you are doing. I wanted to ask you a question. I have a high enough rate limit for the Perspective API to do real-time scoring on the data as it comes in. I am writing the code now for scoring Reddit, but I'm unsure on how..
Conversation
.. to pre-process the Reddit comments before scoring the text. I was thinking two main things would be to remove links from the text and also to make sure that any quoted text is also removed (if a child comment author is quoting the parent comment.
Any suggestions on this?
3
Replying to
Also, removing quoted text is a neat idea, but I wonder if such text contains semantic information relevant to the Perspective API? E.g. if someone uses racial hate speech, then another user quotes it (and agrees with it), then they are diffusing the hate speech...
2
1
Replying to
Great response! I really appreciate your input. I think I will reach out to their dev team and ask them how best to handle these type of situations. As you said, Perspective may account for URLS and do something special for them.
2
1
The API has a field for "context," where one can put quoted text or the thing they reply to. A year ago it was ignored, but maybe that changed? Also there was a 3k char limit to comments: quotes/links could eat up much of that. Removing double whitespace and NPCs helps there.
1
1
Thank you! I'm getting ready to score Reddit data -- I wasn't aware of the length restrictions. How would you handle that? We could either not sore the cmment or we could truncate it from the beginning point
1
1
For my use, I went with scoring the first x allowable characters. You could get fancy and score it in chunks then average them? I found the char limit by reading error messages: it could have changed since my use so do test it.
2
1
Most comments are under that limit. But it's Reddit, so people can and literally do type everything Unicode allows, sometimes at the length of a small novel.
1
2
Exactly! You also have subreddits like /r/counting where it is nothing but numbers -- that would probably be one of the subreddits over X comments a month that have one of the lowest toxicity ratings. (Now I'm wondering if the model does something interesting for 69 or 1488)
Oh there's a whole WORLD of number codes and shibboleth words that I'd bet are not in the models. See, e.g. Cynthia Miller-Idriss' research.
2
1
Oh and she has a Twitter!
1


