Re last two retweets: IMO the important advance in DL is not any specific architecture or learning method but the use of autodiff to compose many modules and learn them end to end. Which makes arguments like this seem like a red herring.
-
-
Jointly as in one pass as opposed to two separate training runs?
-
Mmm, I think of it more in terms of a compiler optimization like loop unrolling.
-
Disclaimer, I'm not an expert here. But what I meant was: previously if I stacked two models, I'd train one against some intermediate loss function I selected, then uses its trained outputs as inputs into another (with some other, final loss function).
-
What autodiff makes easy is training the stack of model1 -> model2 with respect to the same, final loss function, which will lead to a differently (and better) trained model1 than if you trained them sequentially.
-
The benefits of this are probably easiest to see with something like image recognition - "how do we optimize our edge detection? I dunno, however is going to make it easier to tell cats from dogs 7 layers up the stack. Let the gradients sort that out."
End of conversation
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.