Notice something? The text is breakable in all 3. The text to image uses a text model to generate an image from an image model but didn’t recognize that the image model includes images containing text… so it treated the cartoon-text as an image. I think we broke this.
This reminds me of something I learned recently (claim made by a psychologist character on a TV show, but sounded true) — you can’t read text while dreaming because your text processing part is not active while dreaming.
If anyone wants to try a simpler version of this challenge, prompt the system with ‘A man holding a sign saying “hello world”’
This might work because signs are a simpler mixed image-text concept than speech bubbles that firm a joke
But I suspect you’ll get an unreadable sign