Conversation

Turns out that all you need are ensembled "foundation models" to do some cool zero-shot video VQA, captioning! Cool work and team!
Quote Tweet
With multiple foundation models “talking to each other”, we can combine commonsense across domains, to do multimodal tasks like zero-shot video Q&A or image captioning, no finetuning needed. Socratic Models: website + code: socraticmodels.github.io paper: arxiv.org/abs/2204.00598
Show this thread
Embedded video
0:31
138.8K views