With multiple foundation models “talking to each other”, we can combine commonsense across domains, to do multimodal tasks like zero-shot video Q&A or image captioning, no finetuning needed.
Socratic Models:
website + code: socraticmodels.github.io
paper: arxiv.org/abs/2204.00598
Conversation
Replying to
From recalling events, to contextual and temporal reasoning – prompting foundation models to engage in guided Socratic discussions enables a variety of new open-ended video Q&A capabilities.
1
14
113
One way to approach video understanding is to turn it into a reading comprehension problem. This turns a classically hard computer vision task into something that we know large language models are good at.
2
10
94
A couple more examples – here’s zero-shot image captioning, with the large language model (LM) and visual-language model (VLM) working together. Code is already open-source for this one: colab.research.google.com/drive/1KOlc9nN
1
6
68
And here’s video-to-text retrieval. The Socratic Models framework makes it easy to add together new modalities (like speech from audio). In this case we can provide a new zero-shot SoTA, nearing the best finetuned methods.
1
2
42
In general, we’re excited about Socratic Models – they present new ways to think about how we can tackle new multimodal applications with the existing foundation models that we already have today, without additional finetuning or data collection.
2
1
53
This came out of an amazing collaboration between Robotics and AR teams at Google w/ Vincent Vanhoucke
1
2
52
Replying to
Is there a performance benefit to making the models repeatedly translate to and from English (compared to them communicating in embedding-ese) or is this done for interpretability/some other reason?
2
12




