A few thoughts to follow. This isn't the result of any deep reflection or understanding on my part, just a few first impressions based on a quick read of the paper, some of the background papers and other materials. Caveat emptor.
-
-
Show this thread
-
First, GPT-3 is just plain fun. I mean, look at this proposal for a new religion, based on a prompt by
@flantz: https://twitter.com/flantz/status/1284322274313752576 …pic.twitter.com/W4Fbz02j8b
Show this thread -
Or consider that simply telling it to improve on its first attempt at an essay can plausibly result in better outcomes (the author
@nicklovescode works at@openAI):https://twitter.com/nicklovescode/status/1284685741759492096 …Show this thread -
One thing that stands out: GPT-3 does lots of things remarkably well, and lots of things poorly. Here's a nice example of the latter, from
@gwern (https://www.gwern.net/GPT-3#parity ): it fails to learn how to determine the parity of a bit stringpic.twitter.com/wRgSOKtTNc
Show this thread -
Another example: after being shown 10,000 examples of reversed words (eg "gradient -> tneidarg") and then asked to reverse some words, it gets almost everything wrong.
Show this thread -
It does a bit better on what you might call "easy anagrams" -- anagrams where the first and last letter is held constant. Still, it gets only about 15% correct.
Show this thread -
So: it can do what appear to be some very complicated tasks remarkably well, and fails miserably at certain other tasks that seem _a priori_ much easier.
Show this thread -
What's more, it's not always obvious in advance which tasks it will do well on and which poorly. Nor, for that matter, is it always obvious after the fact! I'm not sure how I'd write a checker for "Does this sound like a plausible religion or not?"
Show this thread -
It's a kind of brittleness that shows up often in AI systems. For instance, a few years back people got really good at building systems that are fantastic at solving certain image recognition tasks. In certain (narrow) ways even better than humans: http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/ …pic.twitter.com/zAl1iKxjSg
Show this thread -
That's an amazing breakthrough. But: many early systems turned out to be extremely brittle. You could build adversarial examples, changing just a few pixels to cause the system to incorrectly classify a screwdriver as a St. Bernard dog (etc). https://arxiv.org/abs/1802.08195 pic.twitter.com/LgL1U1Cm8m
Show this thread -
(Aside: a more recent paper on adversarial examples that fool both computers and time-limited humans: https://arxiv.org/abs/1802.08195 )
Show this thread -
The discovery of adversarial examples was great news: it acted as a prompt to go much deeper into the systems, understand better what was going on, how they were brittle, etc. I don't know current state of the art, but such problems are a great stimulus to build better systems
Show this thread -
Similarly, trying to understand _why_ GPT-3 fails at parity or at word reversing seems like a good challenge. And, of course, human beings _also_ have trouble computing parity for bit strings ("hang on, where was I?"), or get taken in by visual illusions.
Show this thread -
Almost certainly: using a richer internal representation would help for these particular problems. GPT3 doesn't use a character-level representation, but rather uses BPEs (byte-pair encoding) that chunk characters together: https://www.gwern.net/GPT-3#bpes (
@gwern).pic.twitter.com/YXsu2Ilgip
Show this thread -
It seems plausible that extending it to use both character-level and BPE-level can help with problems like parity and anagrams and word-reversal.
Show this thread -
One of the most striking things about GPT-3 is that it's not so different than GPT-2. It's much, much bigger, though, and consumed far more resources. They did tweak it in a number of ways:pic.twitter.com/c16wJb5lhg
Show this thread -
Also interesting to see the scale of computation used during training, and at runtime:pic.twitter.com/7oniPdVHHQ
Show this thread -
Going back to the striking patterns of strengths and weaknesses, I'm reminded of Minsky's "Society of Mind", the notion that there's no magic trick, but rather just a large collection of tricks that need to be put together in a patchwork:pic.twitter.com/vtGpkf5vcd
Show this thread -
In that sense, it seems like GPT-3 and similar models may be prototypes for a powerful extra piece in the patchwork needed. We need to understand them much better, but something interesting seems to be going on in them!
Show this thread -
It's illuminating to contrast GPT-3 with much older systems like SHRDLU (from 1970): https://en.wikipedia.org/wiki/SHRDLU . If you look closely at the SHRDLU transcript here, it seems much more impressive, in some ways, than GPT-3: it's really doing some sophisticated reasoning.pic.twitter.com/QsI12kQeun
Show this thread -
The reason is that SHRDLU genuinely had sophisticated internal models for the world it described (block world). It was a narrow understanding -- by contrast, GPT-3 is astoundingly responsive across domains -- but quite deep in block world, perhaps deeper in some ways than a human
Show this thread -
How deep is the understanding underlying GPT-3? One fun thing reading the paper is to realize: I don't know! In fact, I'm not sure anyone does? Maybe
@nottombrown?@AlecRad ? One of the other authors?Show this thread -
The reason is that neural nets can be quite opaque, even if you understand their architecture, and how they're trained. That is, they can discover all kinds of structure during training, without the programmer being aware of it.
Show this thread -
This happens, for instance, in image classifiers. There are now some great papers which crack open the black box of image classifiers, and start to understand how they work, what representations they've discovered in training.
Show this thread -
It turns out that many of the classifiers discover really interesting things about how to see, as they're trained, without the programmers being conscious of those things.
@distillpub has quite a few fun papers about this, eg:https://distill.pub/2019/activation-atlas/ …Show this thread -
Returning to language models, I suspect a way to make progress is: (a) break open the black box of language models like GPT-3, trying to understand what internal representations it's discovered; (b) understand the deficiencies of those repns; & (c) fix them in a new architecture
Show this thread -
All speculation on my part, vaporthought. Still, the goal would be to support the automatic discovery of deeper representations during unsupervised training, perhaps even models as deep as SHRDLU or deeper.
Show this thread -
Small amendment on the training for anagrams etc (100 examples in training set, 10,000 in test):https://twitter.com/nottombrown/status/1284965736494981120 …
Show this thread -
A comment on my thread above: a few people have been kind enough to say this thread is a "useful take" (or, sometimes, a "not-so-useful take"
) on GPT-3. I appreciate the intentions and the generous words.Show this thread -
But with the caveat that w/ any complex generative system you can't really understand much at all with a few hours of playing around Imagine you picked up a guitar for 3 hours & tried to learn to play. At the end you might well put it down and go "lame, the music was no good"
Show this thread - Show replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.
, now home in
.