We begin by proposing 2 baselines for discovering artifacts in WS-like questions. By perturbing the input into non-sensible texts, models that perform better than random, suggest the data contain artifacts.pic.twitter.com/D69ofBJwHX
Voit lisätä twiitteihisi sijainnin, esimerkiksi kaupungin tai tarkemman paikan, verkosta ja kolmannen osapuolen sovellusten kautta. Halutessasi voit poistaa twiittisi sijaintihistorian myöhemmin. Lue lisää
We begin by proposing 2 baselines for discovering artifacts in WS-like questions. By perturbing the input into non-sensible texts, models that perform better than random, suggest the data contain artifacts.pic.twitter.com/D69ofBJwHX
We find that WSC suffers from such artifacts, while Winogrande's test set, which was automatically filtered - much less.pic.twitter.com/kbRa1Fr3wZ
Next, we revisit the commonly-used accuracy for WS-style questions. Since the datasets are not sampled iid (they contain minimal pairs), accuracy doesn't fit as a metric, and we propose to use a new evaluation: Group Scoring.
In Group Scoring, a model gets a point only if it gets both questions correctly. This also helps reduce the risk of giving away points to artifact questions, which we discovered exist in WS datasets (more details and explanations in the paper!).
Then, we discuss the nature of the Winograd Schema as a test for commonsense reasoning and claim we should use such a test for evaluation and not in a supervised setup. Thus, we use the zero-shot setting for evaluating progress on WS.
When doing so and combining our other findings, we show that all models we experimented with - perform worse than random!
We conclude that most of the progress on WS doesn't come from improvement in Language Models but from the Winogrande dataset, which we shouldn't use for training in the first place, but for evaluation.
I have many new and open questions after working on this project, such as - "how to separate between commonsense knowledge and reasoning", and "how can we evaluate properly when parts of the data are in the pretraining corpus of LMs". If you're interested - let's chat!
Twitter saattaa olla ruuhkautunut tai ongelma on muuten hetkellinen. Yritä uudelleen tai käy Twitterin tilasivulla saadaksesi lisätietoja.