let’s try it! @openAI want to test your full model on those?
@JeffDean want to try? @etzioni @YejinChoinka @ylecun ?
would love to see any general-purpose model (not directly tailored to these examples) that can approach grade-school performance.
open challenge to all.https://twitter.com/sterlingcrispin/status/1188286346487455744 …
not interested in multiple choice; i want to see completion ala GPT-2 - and my son.
-
-
It's a good idea for a benchmark, and relates to this thread about what NLP should focus on after Superglue:https://twitter.com/sleepinyourhat/status/1188234498955186183 …
-
Any multiple choice dataset can be evaluated generatively, like we did for Cosmos QA (https://arxiv.org/abs/1909.00277 ), but the reason why generative evaluation is not the mainstream yet is because the field does not yet know how to evaluate the system output automatically.
- 1 more reply
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.