Re last two retweets: IMO the important advance in DL is not any specific architecture or learning method but the use of autodiff to compose many modules and learn them end to end. Which makes arguments like this seem like a red herring.
-
-
Replying to @avibryant
Didn’t model composition (or stacking) exist before autodiff?
1 reply 0 retweets 0 likes -
Replying to @vitalygordon
The hard part without autodiff is jointly learning the stacked models. Not to say you couldn't derive a procedure to do so but it would be extra work each time.
1 reply 0 retweets 1 like -
Replying to @avibryant
Jointly as in one pass as opposed to two separate training runs?
1 reply 0 retweets 0 likes -
Replying to @vitalygordon @avibryant
Mmm, I think of it more in terms of a compiler optimization like loop unrolling.
1 reply 0 retweets 0 likes -
Replying to @fdaapproved @vitalygordon
Disclaimer, I'm not an expert here. But what I meant was: previously if I stacked two models, I'd train one against some intermediate loss function I selected, then uses its trained outputs as inputs into another (with some other, final loss function).
1 reply 0 retweets 1 like -
What autodiff makes easy is training the stack of model1 -> model2 with respect to the same, final loss function, which will lead to a differently (and better) trained model1 than if you trained them sequentially.
3 replies 0 retweets 1 like
The benefits of this are probably easiest to see with something like image recognition - "how do we optimize our edge detection? I dunno, however is going to make it easier to tell cats from dogs 7 layers up the stack. Let the gradients sort that out."
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.