A common beginner mistake is to misunderstand the meaning of the term "interpolation" in machine learning.
Let's take a look 


-
-
Let's start with a very basic example. Consider MNIST digits. Linearly interpolating between 2 MNIST samples does not produce a MNIST sample, but blurry images: pixel space is not linearly interpolative for digits!pic.twitter.com/JTjGGqwHjr
Show this thread -
However, if you interpolate between two digits *on the latent manifold of the MNIST data*, the mid-point between two digits still lies on the manifold of the data, i.e. it's still a plausible digit.
Show this thread -
Here's a very simple way to visualize what's going on, in the trivial case of a 2D encoding space and a 1D latent manifold. For typical ML problems, the encoding space has millions of dimensions and the latent manifold has 2D-1000D (could be anything really).pic.twitter.com/nxTU6BJCbA
Show this thread -
But wait, what does that really mean? What's a "manifold"? What does "latent" mean? How do you learn to interpolate on a latent manifold?
Show this thread -
Let's dive deeper. But first: if you want to understand these ideas in-depth in a better format than a Twitter thread, grab your copy of Deep Learning with Python, 2nd edition, and read chapter 5 ("fundamentals of ML"). It covers all of this in detail.https://www.manning.com/books/deep-learning-with-python-second-edition?a_aid=keras&a_bid=76564dff …
Show this thread -
Ok, so, consider MNIST digits, 28x28 black & white images. You could say the "encoding space" of MNIST has 28 * 28 = 784 dimensions. But does that mean MNIST digits represent "high-dimensional" data?
Show this thread -
Not quite! The dimensionality of the encoding space is an entirely artificial measure that only reflects how you choose to encode the data -- it has nothing to do with the intrinsic complexity of the data.
Show this thread -
For instance, if you take a set of different numbers between 0 and 1, print them on sheets of paper, then take 2000x2000 RGB pictures of those papers, you end up with a dataset with 12M dimensions. But in reality your data is scalar, i.e. 1-dimensional.
Show this thread -
You can always add (and usually remove) encoding dimensions without changing the data, simply by picking a different encoding scheme. That is, until you hit the "intrinsic dimensionality" of the data (past which you would be destroying information).
Show this thread -
For MNIST, for instance, very few random grids of 28x28 pixels form a valid digit. Valid digits occupy a tiny, microscopic *subspace* within the encoding space. Like a grain of sand in a stadium. You call it the "latent space". This is broadly true for all forms of data.
Show this thread -
Further, the valid digits aren't sprinkled at random within the encoding space. The latent space is *highly structured*. So structured, in fact, that for many problems it is *continuous*.
Show this thread -
This just means that for any two samples, you can slowly morph one sample into the other without "stepping out" of the latent space.pic.twitter.com/c4bg4q0O4W
Show this thread -
For instance, this is true for digits. This is also true for human faces. For tree leaves. For cats. For dogs. For the sounds of the human voice. It's even true for sufficiently structured discrete symbolic spaces, like human language! (face grid image by the awesome
@dribnet)pic.twitter.com/kVgu6CDlHM
Show this thread -
The fact that this property (latent space = very small subspace + continuous & structured) applies to so many problems is called the *manifold hypothesis*. This concept is central to understanding the nature of generalization in ML.
Show this thread -
The manifold hypothesis posits that for many problems, your data samples lie on a low-dimensional manifold embedded in the original encoding space.
Show this thread -
A "manifold" is simply a lower-dimensional subspace of some parent space that is locally similar to a linear (Euclidian) space. (i.e. it is continuous and smooth). Like a curved line within a 3D space.
Show this thread -
When you're dealing with data that lies on a manifold, you can use *interpolation* to generalize to samples you've never seen before. You do this by using a *small subset* of the latent space to fit a *curve* that approximately matches the latent space.pic.twitter.com/qkcmPHOaWX
Show this thread -
Once you have such a curve, you can walk on the curve to make sense of *samples you've never seen before* (that are interpolated from samples you have seen). This is how a GAN can generate faces that weren't in the training data, or how a MNIST classifier can recognize new digits
Show this thread -
If you're in a high-dimensional encoding space, this curve is, of course, a high-dimensional curve. But that's because it needs to deal with the encoding space, not because the problem is intrinsically high-dimensional (as mentioned earlier).
Show this thread -
Now, how do you learn such a curve? That's where deep learning comes in.
Show this thread -
But by this point this thread is LONG and the Keras team sync starts in 30s, so I refer you to DLwP, chapter 5 for how DL models and gradient descent are an awesome way to achieve generalization via interpolation on the latent manifold.https://www.manning.com/books/deep-learning-with-python-second-edition?a_aid=keras&a_bid=76564dff …
Show this thread -
I'm back, just wanted to add one important note to conclude the thread: deep learning models are basically big curves fitted via gradient, that approximate the latent manifold of a dataset. The *quality of this approximation* determines how well the model will generalize.
Show this thread -
The ideal model literally just encodes the latent space -- it would be able to perfectly generalize to *any* new sample. An imperfect model will partially deviate from the latent space, leading to possible errors.
Show this thread -
Being able to fit a curve that approximates the latent space relies critically on two factors: 1. The structure of the latent space itself! (a property of the data, not of your model) 2. The availability of a "sufficiently dense" sampling of the latent manifold, i.e. enough data
Show this thread -
You *cannot* generalize in this way to a problem where the manifold hypothesis does not apply (i.e. a true discrete problem, like finding prime numbers).
Show this thread -
In this case, there is no latent manifold to fit to, which means that your curve (i.e. deep learning model) will simply memorize the data -- interpolated points on the curve will be meaningless. Your model will be a very inefficient hashtable that embeds your discrete space.
Show this thread -
The second point -- training data density -- is equally important. You will naturally only be able to train on a very space sampling *of the encoding space*, but you need to *densely cover the latent space*.
Show this thread -
It's only with a sufficiently dense sampling of the latent manifold that it becomes possible to make sense of new inputs by interpolating between past training inputs without having to leverage additional priors.pic.twitter.com/SmRvEN2NXS
Show this thread -
The practical implication is that the best way to improve a deep learning model is to get more data or better data (overly noisy / inaccurate data will hurt generalization). A denser coverage of the latent manifold leads a model that generalizes better.
Show this thread -
This is why *data augmentation techniques* like exposing a model to variations in image brightness or rotation angle is an extremely effective way to improve test-time performance. Data augmentation is all about densifying your latent space coverage (by leveraging visual priors).
Show this thread - Show replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.