Conversation

Replying to
C-BeT can take a continuous demo with multi-modal actions without special directions or labels, and learn a multi-modal 📷 or 🎞️-conditioned policy that solves long-horizon tasks using image obs! E.g. all these tasks on our play kitchen are learned from one 4.5 hour long demo.
1
13
Our method is similar to "prompting" practices for GPT — we train transformers that predict multi-modal distributions over actions given sequences of current+future frames. During eval we condition on current + desired env frames, and it just works — 0 finetuning required 🔥
Image
1
11
Pre-trained BYOL features for our visual observations mean the behavior model takes only 4 hours to train And we don't have to worry much about random visual obstructions, as you can see here: 🪴🐵🧊
1
10
Replying to
Very nice work! One challenge in learning from play data that we saw in LfP are the biased + highly multimodal action distributions. Great to see that C-BeT is robust enough to predict action distributions and use that understanding to generalize!
1
5