Turns out that all you need are ensembled "foundation models" to do some cool zero-shot video VQA, captioning! Cool work and team!
Quote Tweet
With multiple foundation models “talking to each other”, we can combine commonsense across domains, to do multimodal tasks like zero-shot video Q&A or image captioning, no finetuning needed.
Socratic Models:
website + code: socraticmodels.github.io
paper: arxiv.org/abs/2204.00598
Show this thread
0:31
138.8K views

