let’s try it! @openAI want to test your full model on those?
@JeffDean want to try? @etzioni @YejinChoinka @ylecun ?
would love to see any general-purpose model (not directly tailored to these examples) that can approach grade-school performance.
open challenge to all.https://twitter.com/sterlingcrispin/status/1188286346487455744 …
-
-
Replying to @GaryMarcus @OpenAI and
I've seen benchmarks with these kind of commonsense questions but they always have multiple choice answers. I don't think any system out there is close to being able to deal with the kind of open-answer format you suggest. I'd be curious to hear if I'm wrong.
1 reply 0 retweets 3 likes -
Replying to @MelMitchell1 @OpenAI and
not interested in multiple choice; i want to see completion ala GPT-2 - and my son.
1 reply 0 retweets 0 likes -
Replying to @GaryMarcus @OpenAI and
Melanie Mitchell Retweeted Sam Bowman
It's a good idea for a benchmark, and relates to this thread about what NLP should focus on after Superglue:https://twitter.com/sleepinyourhat/status/1188234498955186183 …
Melanie Mitchell added,
1 reply 0 retweets 2 likes -
Replying to @MelMitchell1 @GaryMarcus and
Any multiple choice dataset can be evaluated generatively, like we did for Cosmos QA (https://arxiv.org/abs/1909.00277 ), but the reason why generative evaluation is not the mainstream yet is because the field does not yet know how to evaluate the system output automatically.
1 reply 1 retweet 2 likes
true. but this is where the benchmarks kowtow to the needs of the field, when in fact the field should move to the benchmarks that more deeply probe comprehension, even if they don’t. fit some preconceived testing paradigm.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.