Tweetovi
- Tweetovi, trenutna stranica.
- Tweetovi i odgovori
- Medijski sadržaj
Blokirali ste korisnika/cu @dcpage3
Jeste li sigurni da želite vidjeti te tweetove? Time nećete deblokirati korisnika/cu @dcpage3
-
Prikvačeni tweet
Ever wanted to train CIFAR10 to 94% in 26 SECONDS on a single-GPU?! In the final post of our ResNet series, we open a bag of tricks and drive training time ever closer to zero... Colab: https://colab.research.google.com/github/davidcpage/cifar10-fast/blob/master/bag_of_tricks.ipynb … Blog: https://myrtle.ai/how-to-train-your-resnet-8-bag-of-tricks/ …pic.twitter.com/3C0DKV3AAP
Hvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
David Page proslijedio/la je Tweet
The problem though with "you can always add those tricks to get the numbers up" is that *very* often I see papers that don't do data aug, or don't tune hyper-params, etc, then claim their new idea helps. But then I find it's actually just a poor proxy for the things they skipped
Hvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
David Page proslijedio/la je Tweet
If you're interested in training small accurate nets efficiently, then the best in the world is
@dcpage3, and he told all his secrets in this amazing series:https://myrtle.ai/how-to-train-your-resnet-1-baseline/ …Hvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
David Page proslijedio/la je Tweet
Fun fact: More data can hurt in linear regression. https://arxiv.org/abs/1912.07242 (aside: this is about the 3rd time that studying deep learning has taught me something about linear regression).
Prikaži ovu nitHvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
David Page proslijedio/la je Tweet
I've implemented a simple tool for analyzing the full loss Hessian spectrum here: https://github.com/LeviViana/torchessian …, I checked the spectrum changes after removing BatchNorm layers for a ResNet-18 architecture.
Hvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
This looks promising! Just tried a simple version - backprop the top 40% of each batch by loss - and it knocks 10% off a highly tuned CIFAR10 training time (94% in 24s on 1 GPU as of now!)https://twitter.com/arXiv_Daily/status/1179865684848726016 …
Hvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
David Page proslijedio/la je Tweet
While the BN paper didn't define ICS formally, as long as one understands the proposed mechanism, it's easy to plug in a precise definition of ICS that makes the argument work out (as David did in the blog post).
Hvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
David Page proslijedio/la je Tweet
Actually, there were a few 1998 papers by Nicol Schraudolph on various variable centering schemes for multlilayer nets https://nic.schraudolph.org/bib2html/sort_date.html …
Hvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
David Page proslijedio/la je Tweet
Precisely. (I've since been told by my random matrix theory colleagues at Courant that the distribution of eigenvalues of a random covariance matrix can be obtained in a much simpler manner than with the replica symmetry breaking calculations used for this paper).https://twitter.com/dcpage3/status/1171867628316635137 …
Hvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
David Page proslijedio/la je Tweet
This is the best distillation of recent (and old!) research on batchnorm I've seen. There is so much to learn about training mechanics by studying this thread and the links it contains.https://twitter.com/dcpage3/status/1171867587417952260 …
Prikaži ovu nitHvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
Thanks to
@jeremyphoward for encouraging me to write this up and feedback on an early draft!Prikaži ovu nitHvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
Hvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi
-
So we have given precise experimental meaning to the statement that 'internal covariate shift' limits LRs and that BN works by preventing this... ...matching the intuition of the original paper!
Prikaži ovu nitHvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
The same directions span the outlying subspace of the Hessian and limit the learning rate. This is unsurprising - a large change in output distribution wreaks havoc with predictions (as with the ‘cat’ neuron above.)pic.twitter.com/3TFLKyWFtD
Prikaži ovu nitHvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
The biggest impacts come from synchronised changes in distribution throughout the network - giving a precise definition of problematic 'internal covariate shift' !pic.twitter.com/Rx53r4u5Fc
Prikaži ovu nitHvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
So how to relate outlying eigenvalues to changes in hidden layer distributions? Insight comes from identifying directions in parameter space that produce large changes in output distribution.pic.twitter.com/n8SBfShvn6
Prikaži ovu nitHvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
These results are exciting because they give quantitative insight into how BN aids optimisation! Outlying eigenvalues of the Hessian are removed in a way reminiscent of the old analyses…pic.twitter.com/B2W9XBm92h
Prikaži ovu nitHvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
Recent papers have studied the Hessian of the loss for deep nets experimentally: (
@leventsagun et al) http://arxiv.org/abs/1611.07476 , http://arxiv.org/abs/1706.04454 (Papyan) http://arxiv.org/abs/1811.07062 . (@_ghorbani et al) http://arxiv.org/abs/1901.10159 compare what happens with and without BN.Prikaži ovu nitHvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
...until the Batch Norm paper which proved that this works, allowing training at much higher LRs! According to the old wisdom, this comes from better conditioning of the Hessian.pic.twitter.com/lj3BkUHasE
Prikaži ovu nitHvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
It was understood that it should be helpful to center and normalise hidden layers - not just inputs - but the technique wasn't fully developed…
Prikaži ovu nitHvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi -
Since this affects all the training examples, it will dominate effects that vary from sample to sample (in the absence of strong input correlations.)
Prikaži ovu nitHvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi
Čini se da učitavanje traje već neko vrijeme.
Twitter je možda preopterećen ili ima kratkotrajnih poteškoća u radu. Pokušajte ponovno ili potražite dodatne informacije u odjeljku Status Twittera.