Now now. I didn't say they understand nothing; you know very well I wouldn't claim that. The "extend this prompt" demos show that the new models extend the span of "local coherence" far beyond what n-gram models would capture. But they also reveal major semantic failures...
-
-
Replying to @tdietterich
real question is whether deep learning alone can maintain representations of the world as it unfolds in a narrative / article etc. aside from highly limited worlds of the FB babi tasks, i have not seen any evidence that they can. and gave principled reasons to think otherwise.
3 replies 0 retweets 1 like -
Replying to @GaryMarcus @tdietterich
Even on limited words babi tasks what happens when two or more tasks are clubbed together? I wud say the first sign of understanding will come when a system can pass any of the babi tasks individually or collectively with same accuracy.
1 reply 0 retweets 1 like -
That said, the babi tasks tests very limited aspect of only a few senses. Like in task 14 time manipulation, it only tests ability to understand before/after or timespans like morning or evening. Understanding time involves far more complex aspects than that.
1 reply 0 retweets 1 like -
Replying to @alok_damle @GaryMarcus
I think babi is a very limited (i.e., nearly useless) set of tasks. But I would reword your statement to say "There are many more aspects of time to understand than just that."
2 replies 0 retweets 1 like -
Replying to @tdietterich @GaryMarcus
Exactly my point. That the performance of state of art AI wud go nuts if one combines just two of these "nearly useless" tasks. And even when stat based AI pass this time manipulation task, they dont understand even "just that".
1 reply 0 retweets 1 like -
Problem is that the focus is always on the benchmarks rather than approach of the system to pass them. If approach is human level then the system can pass babi tasks as well as the one
@GaryMarcus is proposing.1 reply 1 retweet 4 likes -
Replying to @alok_damle @GaryMarcus
Agreed. "Benchmarking disease" is particularly bad when there is only one benchmark (e.g., ImageNet 1000 categories). We need many, diverse, tasks to prove that the AI/ML techniques are general.
3 replies 4 retweets 11 likes -
Replying to @tdietterich @GaryMarcus
More than diverse, they should be human level benchmarks. Marking a phrase as an answer in text (like SQuAD) is not something we ask our kids in exams.
1 reply 0 retweets 2 likes -
The ultimate test should be training one model per task and achieving good performance across multiple benchmarks. Otherwise it's hard to prove the performance is not inflated due to similarity in train/test distributions.
1 reply 0 retweets 3 likes
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.