When I first saw this a few days ago I thought “peculiar —wonder if it’s true.” Looking at this graph, wondered what happened at 8-9k epochs (sudden dive).
-
-
Replying to @Meaningness @The_Lagrangian
Skimmed bits of paper, got increasingly suspicious that result is artifact of an artificial task. Went looking for the task… found this.pic.twitter.com/MmN7Lq6U3I
2 replies 0 retweets 2 likes -
Replying to @Meaningness
they mention later on that they also do the analysis for cases with no symmetries
1 reply 0 retweets 0 likes -
Replying to @The_Lagrangian @Meaningness
the high signal -> high noise transition in training is also general afaict in the couple of DL projects I've tried, NNs do in general seem to keep improving after a clear signal is gone
1 reply 0 retweets 1 like -
Replying to @The_Lagrangian
Oh, well *that* is interesting. I would like to see it demonstrated on a real task. This paper has gotten a lot of hype, which doesn’t seem deserved.pic.twitter.com/eiulZpOhkK
2 replies 0 retweets 2 likes -
Replying to @Meaningness @The_Lagrangian
Now I will drink my first a.m. coffee and stop being such a grump.
1 reply 0 retweets 0 likes -
Replying to @Meaningness
grump is good! that said I think there's a couple things here: (1) it's hard to expand this analysis past toy-ish data since you need to be able to define an actual distribution over the inputs, and NNs are usually used in cases where we don't have one (images, for instance)
1 reply 0 retweets 1 like -
Replying to @The_Lagrangian @Meaningness
(2) I think the explanation accounts for both why NNs are so successful _and_ for why they're so unsuccessful: they can only compress in the dumbest way possible (by randomly jiggling weights)
1 reply 0 retweets 1 like -
Replying to @The_Lagrangian
I see… (2) would be exciting if true, because you could explicitly drive for max mutual info instead. (Hasn’t this been tried? I don’t know the literature, but seems a likely move)
1 reply 0 retweets 0 likes -
Replying to @Meaningness
not that i know of but I am not that familiar with the lit. hopefully one of my DL researcher followers can chime in! (also- goal is for hidden layers to minimize mutual info with input (compression) while maximizing with output (error reduction))
1 reply 0 retweets 0 likes
yeah, min, max, whatevs
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.