Conversation

One way to approach video understanding is to turn it into a reading comprehension problem. This turns a classically hard computer vision task into something that we know large language models are good at.
2
94
And here’s video-to-text retrieval. The Socratic Models framework makes it easy to add together new modalities (like speech from audio). In this case we can provide a new zero-shot SoTA, nearing the best finetuned methods.
1
42
In general, we’re excited about Socratic Models – they present new ways to think about how we can tackle new multimodal applications with the existing foundation models that we already have today, without additional finetuning or data collection.
Image
2
53
Replying to
Is there a performance benefit to making the models repeatedly translate to and from English (compared to them communicating in embedding-ese) or is this done for interpretability/some other reason?
2
12