Conversation

I’m already bored of frustratingly vague but visually rich dalle2 stuff…want more precise impoverished stuff Can someone with access try to generate comics in xkcd style? Like “joke about sidewalk scooters in xkcd style” Or anything with very few bits
9
124
Notice something? The text is breakable in all 3. The text to image uses a text model to generate an image from an image model but didn’t recognize that the image model includes images containing text… so it treated the cartoon-text as an image. I think we broke this.
Replying to
This reminds me of something I learned recently (claim made by a psychologist character on a TV show, but sounded true) — you can’t read text while dreaming because your text processing part is not active while dreaming.
6
36
If anyone wants to try a simpler version of this challenge, prompt the system with ‘A man holding a sign saying “hello world”’ This might work because signs are a simpler mixed image-text concept than speech bubbles that firm a joke But I suspect you’ll get an unreadable sign
5
13
Basically, the image model didn’t learn text, and the text model didn’t learn images, this is 2 blind men and an elephant
3
27
Remember text models are trained on input that’s already encoded as digital text strings. They are not literate, and can’t read in an OCR way. This thing only has oral literacy. It hasn’t realized that certain visual patterns correspond to text it knows in the form of a code.
2
13
the text part is blind and the vision part is pre-verbal. The entire effect rests on black box correlations between text and images… labeled images where the labels live in text space and the images live in image space
2
12
Hah I guessed right. This thing is trying to learn language literacy in a way humans don’t even try to. We get literate by being taught the alphabet and the mapping from sounds to visual images (non-Chinese) and learning the “formula” to decide any strong recognized as a string
Quote Tweet
Replying to @vgr
from the paper
Image
1
14
I think there may be a fundamental problem here… how do you learn that input patterns in one kind of model map to tokens in another? The word “cat” is a label for the category of cats in images, but also the name of the set of visual renderings of the word itself…
5
13
There’s some kind of subtle levels/indirection thing going here. I recall Dennett I think talking about it in some book. The name of a thing is itself a nameable thing. Cat is the name of the category containing 🐱🐈, but the name of the word ‘cat’ is actually grbxl (say)…
1
7
The prompt ‘A man with a cat holding a sign saying “cat”’ should be rendered ‘A man with a cat holding a grbxl sign’ But then you can get infinite regress… or pointer chasing At some point the system also needs to learn that grbxl is the name of the word for cat
2
8
In the current paradigm it will never learn this… it will just learn that the world contains lots of images with high correlated presence of cats and grbxls… but grbxls might as well be any other thing cats are often found correlated with, like food saucers
1
8
These are either Godel-grade objections, or AIs will evolve without ever needing to make maps or distinguish them from territories. Maybe sufficiently powerful statistics is indistinguishable from signification Maybe signification is a hack for meatbags that AIs don’t need
2
21
tldr: AIs haven’t learned (in any sense) that the visual form of the written word “cat” corresponds to the token “cat” in language models, but HAVE learned that the cat token corresponds to pictures of cats themselves Not a showstopper but not a trivial issue either
4
13
Speech recognition might have an important bridging role to play here somehow 🤔 The map between text-token space an a continuous sensor field is more direct there, and recorded sound data is dominated by verbal sounds
3
3
If the goal is to reproduce human thinking (which may be a bad goal) I don’t think multimodal learning is enough. You need a dose of GOFAI because human symbolic representation systems are… symbolic. But otoh, it’s worth asking what’s a natural “native” language for AIs?
Quote Tweet
Replying to @vgr
The future is multimodal models. To me, generative image models are more akin to a visual perceptual module in human brains. To approx human cognition, we need to connect modules just as our brains do. Work on past-tense verbs from the 80's I'm studying: stanford.edu/~jlmcc/papers/
1
4
If AIs made up their own symbolic language from raw sensor data what would it be like? Would it even need concepts like “cat” encoded on an arbitrary 26-character string generation scheme? No! Making jokes in human cultural media like cartoons is an unnatural problem for AIs
1
8
Having AIs learn human language at all is a transient hack priority. Like humans learning whale squeak language. And trying to come up with cartoon squeaks whales would find funny. Humans use text representation schemes because we have specific limits/constraints
2
9
If you have an AI a rich enough embodiment with raw sensor info and showed it birds and airplanes, it might figure out flight but bypass all human representation schemes we think must be encountered along the way — symbolic language and math, physics theories etc
1
9
We can no more teach AIs to fly than we can teach birds to fly. Birds sort of taught us how to fly but by simply existing as a demonstration, not by conveying their evolutionary learning history.
1
6
This makes me think language models are a giant yak shave that are valuable right now but in the long term irrelevant. If AIs need language they’ll invent their own. If they need math they’ll discover their own.
1
11
Learning the human versions of important signification maps will be a temporary crutch but not speed up their own actual evolution of comparable native structural phenomena. It’s not clear to me they ever will though. Faking shallow versions of the human thing might be it.
1
6
We are simultaneously overestimating and underestimating what AIs can do by projecting the arbitrary biases of our own embodiments onto a radically different substrate. What humans have learned over 1 million years of evolution with a 2lb brain is simply not that relevant here
2
13
AIs are not “faster humans” anymore than cats are “faster horses” Data from the evolutionary history of horses is not relevant to cars. The only overlap is in the term “horsepower” and roughly similar sizes of cars and carts due to early substitution effects.
3
9
This week’s developments lead me to strongly conclude robotics models (SayCan) >> image models >> text+image models > text models Dall-e2 etc are text++ models that use a subset of image model power to do a sort of interesting parlor trick.
2
9
I’d like to see a language-like thing evolve natively from image models somehow, without much reliance on human language AND/OR humanoid robots that share enough embodiment structure that the dependence is more meaningful Asimovian robots nao
1
9