we didn't use reddit data, we used outbound links from reddit ranked at two stars or higher by users. Part of the reason for thinking about different release of this model is precisely because of issues like this (which occur in larger models also). We'll be sharing more here!
-
New conversation
-
-
-
I've worked on Reddit in relation to harassment and labelled many samples myself. Vast majority samples unutterably banal. Most bad bits very, very low entropy. Mixed in its parts. Overall verdict, batting a better average than some popular sacred texts. E&OE.
-
Hmm, low entropy as in, in a semantic sense? What would higher entropy harassment language look like then? But I guess the point of your tweet was that it would be relatively simple to de-bias the data source? :)
- Show replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.