A common misconception about deep learning is that gradient descent is meant to reach the "global minimum" of the loss, while avoiding "local minima". In practice, a deep neural network that's anywhere close to the global minimum would be utterly useless (extremely overfit)
-
-
An important research direction would be to use the information bottleneck principle to come up with models that have exactly the right amount of memorization capacity for a given task, as well as optimization methods to get to the global optimum
Show this threadThanks. Twitter will use this to make your timeline better. UndoUndo
-
-
-
Most “loss functions” are pretty primitive. I mean the way the “loss” is made to depend on the “parameters” that you put into your model. Not all parameters are created equal!
Thanks. Twitter will use this to make your timeline better. UndoUndo
-
-
-
Why is gradient descent still effective for DL? After all, it is a method to find critical points. But seems like the place that generalize well is more or less a random point of the loss function?
-
If driving down empirical risk does not mean anything in terms of generalization, why is gradient descent superior than random search?
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.