Perhaps our perception shows robustness against adversarial examples simply because the world has always been adversarial during training. We never get a clean bitmap with a label, but coarse, disjointed, fragmentary, noisy samples from our retinas.
I am not a fan of embodimentalism, but suspect that there is a substantial difference between living in an image database and living in a dynamic interactive 3space, and it's not the noise. Learning systems can ultimately only converge on a model of the data they have.