the text part is blind and the vision part is pre-verbal. The entire effect rests on black box correlations between text and images… labeled images where the labels live in text space and the images live in image space
Conversation
Hah I guessed right. This thing is trying to learn language literacy in a way humans don’t even try to. We get literate by being taught the alphabet and the mapping from sounds to visual images (non-Chinese) and learning the “formula” to decide any strong recognized as a string
1
14
It’s trying to learn “to read” through brute statistics of billions of images of text
1
12
I think there may be a fundamental problem here… how do you learn that input patterns in one kind of model map to tokens in another? The word “cat” is a label for the category of cats in images, but also the name of the set of visual renderings of the word itself…
5
13
There’s some kind of subtle levels/indirection thing going here. I recall Dennett I think talking about it in some book. The name of a thing is itself a nameable thing. Cat is the name of the category containing 🐱🐈, but the name of the word ‘cat’ is actually grbxl (say)…
1
7
The prompt ‘A man with a cat holding a sign saying “cat”’ should be rendered ‘A man with a cat holding a grbxl sign’
But then you can get infinite regress… or pointer chasing
At some point the system also needs to learn that grbxl is the name of the word for cat
2
8
In the current paradigm it will never learn this… it will just learn that the world contains lots of images with high correlated presence of cats and grbxls… but grbxls might as well be any other thing cats are often found correlated with, like food saucers
1
8
Deep learning gonna have issues
GIF
read image description
ALT
2
16
These are either Godel-grade objections, or AIs will evolve without ever needing to make maps or distinguish them from territories.
Maybe sufficiently powerful statistics is indistinguishable from signification
Maybe signification is a hack for meatbags that AIs don’t need
2
21
tldr: AIs haven’t learned (in any sense) that the visual form of the written word “cat” corresponds to the token “cat” in language models, but HAVE learned that the cat token corresponds to pictures of cats themselves
Not a showstopper but not a trivial issue either
4
1
13
Speech recognition might have an important bridging role to play here somehow 🤔
The map between text-token space an a continuous sensor field is more direct there, and recorded sound data is dominated by verbal sounds
Replying to
If the goal is to reproduce human thinking (which may be a bad goal) I don’t think multimodal learning is enough. You need a dose of GOFAI because human symbolic representation systems are… symbolic.
But otoh, it’s worth asking what’s a natural “native” language for AIs?
Quote Tweet
Replying to @vgr
The future is multimodal models. To me, generative image models are more akin to a visual perceptual module in human brains. To approx human cognition, we need to connect modules just as our brains do.
Work on past-tense verbs from the 80's I'm studying: stanford.edu/~jlmcc/papers/
1
4
If AIs made up their own symbolic language from raw sensor data what would it be like? Would it even need concepts like “cat” encoded on an arbitrary 26-character string generation scheme?
No!
Making jokes in human cultural media like cartoons is an unnatural problem for AIs
1
1
8
Having AIs learn human language at all is a transient hack priority. Like humans learning whale squeak language.
And trying to come up with cartoon squeaks whales would find funny.
Humans use text representation schemes because we have specific limits/constraints
2
1
9
If you have an AI a rich enough embodiment with raw sensor info and showed it birds and airplanes, it might figure out flight but bypass all human representation schemes we think must be encountered along the way — symbolic language and math, physics theories etc
1
1
9
We can no more teach AIs to fly than we can teach birds to fly. Birds sort of taught us how to fly but by simply existing as a demonstration, not by conveying their evolutionary learning history.
1
6
This makes me think language models are a giant yak shave that are valuable right now but in the long term irrelevant. If AIs need language they’ll invent their own. If they need math they’ll discover their own.
1
2
11
Learning the human versions of important signification maps will be a temporary crutch but not speed up their own actual evolution of comparable native structural phenomena.
It’s not clear to me they ever will though. Faking shallow versions of the human thing might be it.
1
6
They have a lot of useful power but it may not be relevant to the capabilities we are considering here. Horses are much more powerful than humans in muscle terms but Clever Hans learning to fake counting didn’t lead to a species of genius horses
1
3
We are simultaneously overestimating and underestimating what AIs can do by projecting the arbitrary biases of our own embodiments onto a radically different substrate.
What humans have learned over 1 million years of evolution with a 2lb brain is simply not that relevant here
2
1
13
AIs are not “faster humans” anymore than cats are “faster horses”
Data from the evolutionary history of horses is not relevant to cars. The only overlap is in the term “horsepower” and roughly similar sizes of cars and carts due to early substitution effects.
3
9
This week’s developments lead me to strongly conclude robotics models (SayCan) >> image models >> text+image models > text models
Dall-e2 etc are text++ models that use a subset of image model power to do a sort of interesting parlor trick.
2
1
9
I’d like to see a language-like thing evolve natively from image models somehow, without much reliance on human language
AND/OR
humanoid robots that share enough embodiment structure that the dependence is more meaningful
Asimovian robots nao
1
9


