Excited to share our paper where we propose E3B--a new algorithm for exploration in varying environments.
Paper: arxiv.org/abs/2210.05805
Website: e3bagent.github.io
E3B sets new SOTA for both MiniHack and reward-free exploration on Habitat.
A thread [1/N]
Conversation
Replying to
Exploration in standard MDPs is well studied, but what about contextual MDPs (CMDPs) where the environment changes each episode? This general framework captures scenarios such as procgen video games or embodied AI tasks where the agent must generalize across physical spaces.[2/N]
1
5
While exploration in CMDPs has recently started receiving attention, we show that existing methods critically rely on an episodic count-based bonus, and fail if this bonus is removed. This also means they fail in complex envs where each state is seen at most once. [3/N]
1
4
An alternative idea could be to count handcrafted features extracted from states, but this relies heavily on prior knowledge. We show that while this can be effective in some cases, it is difficult to design a feature extractor which works well across many tasks. [4/N]
1
1
To address this limitation, we propose Exploration via Elliptical Episodic Bonuses (E3B). E3B uses an elliptical episodic bonus, which generalizes count-based episodic bonuses to continuous state spaces, paired with a feature extractor learned with an inverse dynamics model.[5/N]
1
3
E3B sets a new SOTA on 16 challenging sparse-reward tasks from the MiniHack suite. In particular, it does so without requiring any feature engineering or task-specific prior knowledge. [6/N]
1
2
7
We also evaluate E3B for reward-free exploration on Habitat, which provides photorealistic simulations of real indoor environments. Here, E3B outperforms existing methods by a wide margin. [7/N]
1
3
1
4
Replying to
Great paper! How does it compare to the Never Give Up (later Agent57) paper? I feel like they are really similar and I didn't see a reference to it in the paper
1
1
Replying to
Thanks! We didn't compare to NGU but others have found it not to work well on procgen envs: openreview.net/pdf?id=j3GK3_x One conceptual difference is that the elliptical bonus automatically normalizes wrt scale but NGU's KNN-based one doesn't which means a few features could dominate
1
2
Show replies
Quote Tweet
Replying to @elonmusk
#Questionoftheday ? Elon can#AI be design to be follow an algorithm to prevent injustice,discrimination and Total truth for all like principle of gravity treat all nature the same way no bias #TOS #DRTM
GIF
read image description
ALT


