A common misconception about deep learning is that gradient descent is meant to reach the "global minimum" of the loss, while avoiding "local minima". In practice, a deep neural network that's anywhere close to the global minimum would be utterly useless (extremely overfit)
-
-
If that were the case, then you could keep training a regularized network indefinitely and never overfit, only memorizing the right amount of data! Such a regularization technique would be a major breakthrough (see other comment about IB).
-
François, why do we need to regularize and still have models that are so big they need regularization? Why not just use smaller models and no regularization? (btw, I didn't know you had a book about DL, saw in the other tweet, very nice)
- Show replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.