David Page

@dcpage3

Machine learning researcher

Vrijeme pridruživanja: travanj 2018.

Tweetovi

Blokirali ste korisnika/cu @dcpage3

Jeste li sigurni da želite vidjeti te tweetove? Time nećete deblokirati korisnika/cu @dcpage3

  1. Prikvačeni tweet
    19. kol 2019.

    Ever wanted to train CIFAR10 to 94% in 26 SECONDS on a single-GPU?! In the final post of our ResNet series, we open a bag of tricks and drive training time ever closer to zero... Colab: Blog:

    Poništi
  2. proslijedio/la je Tweet
    9. sij
    Odgovor korisnicima i sljedećem broju korisnika:

    The problem though with "you can always add those tricks to get the numbers up" is that *very* often I see papers that don't do data aug, or don't tune hyper-params, etc, then claim their new idea helps. But then I find it's actually just a poor proxy for the things they skipped

    Poništi
  3. proslijedio/la je Tweet
    29. pro 2019.
    Odgovor korisniku/ci

    If you're interested in training small accurate nets efficiently, then the best in the world is , and he told all his secrets in this amazing series:

    Poništi
  4. proslijedio/la je Tweet
    16. pro 2019.

    Fun fact: More data can hurt in linear regression. (aside: this is about the 3rd time that studying deep learning has taught me something about linear regression).

    Prikaži ovu nit
    Poništi
  5. proslijedio/la je Tweet
    9. lis 2019.
    Odgovor korisniku/ci

    I've implemented a simple tool for analyzing the full loss Hessian spectrum here: , I checked the spectrum changes after removing BatchNorm layers for a ResNet-18 architecture.

    Poništi
  6. 4. lis 2019.

    This looks promising! Just tried a simple version - backprop the top 40% of each batch by loss - and it knocks 10% off a highly tuned CIFAR10 training time (94% in 24s on 1 GPU as of now!)

    Poništi
  7. proslijedio/la je Tweet
    13. ruj 2019.
    Odgovor korisnicima i sljedećem broju korisnika:

    While the BN paper didn't define ICS formally, as long as one understands the proposed mechanism, it's easy to plug in a precise definition of ICS that makes the argument work out (as David did in the blog post).

    Poništi
  8. proslijedio/la je Tweet
    12. ruj 2019.
    Odgovor korisniku/ci

    Actually, there were a few 1998 papers by Nicol Schraudolph on various variable centering schemes for multlilayer nets

    Poništi
  9. proslijedio/la je Tweet
    12. ruj 2019.

    Precisely. (I've since been told by my random matrix theory colleagues at Courant that the distribution of eigenvalues of a random covariance matrix can be obtained in a much simpler manner than with the replica symmetry breaking calculations used for this paper).

    Poništi
  10. proslijedio/la je Tweet
    11. ruj 2019.

    This is the best distillation of recent (and old!) research on batchnorm I've seen. There is so much to learn about training mechanics by studying this thread and the links it contains.

    Prikaži ovu nit
    Poništi
  11. 11. ruj 2019.

    Thanks to for encouraging me to write this up and feedback on an early draft!

    Prikaži ovu nit
    Poništi
  12. 11. ruj 2019.
    Prikaži ovu nit
    Poništi
  13. 11. ruj 2019.

    So we have given precise experimental meaning to the statement that 'internal covariate shift' limits LRs and that BN works by preventing this... ...matching the intuition of the original paper!

    Prikaži ovu nit
    Poništi
  14. 11. ruj 2019.

    The same directions span the outlying subspace of the Hessian and limit the learning rate. This is unsurprising - a large change in output distribution wreaks havoc with predictions (as with the ‘cat’ neuron above.)

    Prikaži ovu nit
    Poništi
  15. 11. ruj 2019.

    The biggest impacts come from synchronised changes in distribution throughout the network - giving a precise definition of problematic 'internal covariate shift' !

    Prikaži ovu nit
    Poništi
  16. 11. ruj 2019.

    So how to relate outlying eigenvalues to changes in hidden layer distributions? Insight comes from identifying directions in parameter space that produce large changes in output distribution.

    Prikaži ovu nit
    Poništi
  17. 11. ruj 2019.

    These results are exciting because they give quantitative insight into how BN aids optimisation! Outlying eigenvalues of the Hessian are removed in a way reminiscent of the old analyses…

    Prikaži ovu nit
    Poništi
  18. 11. ruj 2019.

    Recent papers have studied the Hessian of the loss for deep nets experimentally: ( et al) , (Papyan) . ( et al) compare what happens with and without BN.

    Prikaži ovu nit
    Poništi
  19. 11. ruj 2019.

    ...until the Batch Norm paper which proved that this works, allowing training at much higher LRs! According to the old wisdom, this comes from better conditioning of the Hessian.

    Prikaži ovu nit
    Poništi
  20. 11. ruj 2019.

    It was understood that it should be helpful to center and normalise hidden layers - not just inputs - but the technique wasn't fully developed…

    Prikaži ovu nit
    Poništi
  21. 11. ruj 2019.

    Since this affects all the training examples, it will dominate effects that vary from sample to sample (in the absence of strong input correlations.)

    Prikaži ovu nit
    Poništi

Čini se da učitavanje traje već neko vrijeme.

Twitter je možda preopterećen ili ima kratkotrajnih poteškoća u radu. Pokušajte ponovno ili potražite dodatne informacije u odjeljku Status Twittera.

    Možda bi vam se svidjelo i ovo:

    ·