Being able to fit a curve that approximates the latent space relies critically on two factors: 1. The structure of the latent space itself! (a property of the data, not of your model) 2. The availability of a "sufficiently dense" sampling of the latent manifold, i.e. enough data
-
Show this thread
-
You *cannot* generalize in this way to a problem where the manifold hypothesis does not apply (i.e. a true discrete problem, like finding prime numbers).
2 replies 5 retweets 81 likesShow this thread -
In this case, there is no latent manifold to fit to, which means that your curve (i.e. deep learning model) will simply memorize the data -- interpolated points on the curve will be meaningless. Your model will be a very inefficient hashtable that embeds your discrete space.
2 replies 9 retweets 81 likesShow this thread -
The second point -- training data density -- is equally important. You will naturally only be able to train on a very space sampling *of the encoding space*, but you need to *densely cover the latent space*.
2 replies 6 retweets 62 likesShow this thread -
It's only with a sufficiently dense sampling of the latent manifold that it becomes possible to make sense of new inputs by interpolating between past training inputs without having to leverage additional priors.pic.twitter.com/SmRvEN2NXS
3 replies 15 retweets 109 likesShow this thread -
The practical implication is that the best way to improve a deep learning model is to get more data or better data (overly noisy / inaccurate data will hurt generalization). A denser coverage of the latent manifold leads a model that generalizes better.
3 replies 8 retweets 78 likesShow this thread -
This is why *data augmentation techniques* like exposing a model to variations in image brightness or rotation angle is an extremely effective way to improve test-time performance. Data augmentation is all about densifying your latent space coverage (by leveraging visual priors).
1 reply 7 retweets 80 likesShow this thread -
In conclusion: the only things you'll find in a DL model is what you put into it: the priors encoded in its architecture and the data it was trained on. DL models are not magic. They're big curves that fit their training samples, with some constraints on their structure.
13 replies 28 retweets 180 likesShow this thread -
Replying to @fchollet
I saw you were discussing this with Yoshua Bengio at the AGI conference, to which he replied that the missing piece is getting rid of the independence assumption in the latent space by assuming some additional 'modularity' prior. Do you have any comments on this?
1 reply 0 retweets 3 likes -
Replying to @mlpeschl
Yoshua is right! The more priors you inject the less data you need to obtain a curve that approximates the latent manifold. Strong & accurate priors enable you to "see" further given the stepping stones (data points) you're given.
1 reply 1 retweet 5 likes
The entire subfields of DL architecture and data augmentation are about leveraging new/more priors in this way. And such priors are often about modularity! This is why we use "layers" or "convolutions" in DL instead of an amorphous soup of parameters.
-
-
Replying to @fchollet
Makes sense. But I suppose the manifold hypothesis persists regardless of the priors we use? Then, end to end DL will never truly get us to the 'system 2' type of capabilities. I guess the uncertainty is in whether we can find good priors to get enough ood generalization?
1 reply 0 retweets 1 like -
Replying to @mlpeschl
It's complicated -- it's basically a fundamental question about the structure of information in the universe. DL only works with spaces where the manifold hypothesis applies (regardless of priors). The question is how far it really extends.
1 reply 1 retweet 3 likes - Show replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.