(2/4) Most ML people deal with silent errors and slow feedback loops via the "ratchet" approach: 1) Start with known working model 2) Record learning curves on small task (~1min to train) 3) Make a tiny code change 4) Inspect curves 5) Run full training after ~5 tiny changes
-
-
Prikaži ovu nit
-
(3/4) Downside of ratchet approach is some designs can't be reached via small incremental changes. Also hard to know *which* tiny code changes to make. This is where understanding under/overfitting, regularization etc is useful. See
@josh_tobin_'s talk:https://www.youtube.com/watch?v=GwGTwPcG0YM …Prikaži ovu nit -
(4/4) Within the ratchet approach, I want more tools and best practices for making feedback loops shorter and for making errors louder. Below is a short list of development speed hacks that I have found useful.
Prikaži ovu nit -
ML dev speed hack #0 - Overfit a single batch - Before doing anything else, verify that your model can memorize the labels for a single batch and quickly bring the loss to zero - This is fast to run, and if the model can't do this, then you know it is broken
Prikaži ovu nit -
ML dev speed hack #1 - PyTorch over TF - Time to first step is faster b/c no static graph compilation - Easier to get loud errors via assertions within the code - Easier to drop into debugger and inspect tensors (TF2.0 may solve some of these problems but is still raw)
Prikaži ovu nit -
ML dev speed hack #2 - Assert tensor shapes - Wrong shapes due to silent broadcasting or reduction is an extreme hot spot for silent errors, asserting on shapes (in torch or TF) makes them loud - If you're ever tempted to write shapes in a comment, make an assert instead
Prikaži ovu nit -
ML dev speed hack #3 - Add ML test to CI - If more than one entrypoint or more than one person working on the codebase, then add a test that runs for N steps and then checks loss - If you only have one person and entrypoint then an ML test in CI is probably overkill
Prikaži ovu nit -
ML dev speed hack #4 - Use ipdb.set_trace() - It's hard to make an ML job take less than 10 seconds to start, which is too slow to maintain flow - Using the ipdb workflow lets you zero in on a bug and play with tensors with a fast feedback loop
Prikaži ovu nit -
ML dev speed hack #5 - Use nvvp to debug throughput - ML throughput (step time) is one place where we have the tools to make errors loud and feedback fast - You can use torch.cuda.nvtx.range_push to annotate the nvvp timeline to be more readable
Prikaži ovu nit -
Curious what other folks recommend for speeding up ML development feedback loops and for making errors louder.
Prikaži ovu nit -
cc tweeps who build stuff quickly:
@karpathy,@catherineols,@jeremyphoward,@Thom_Wolf,@hardmaru,@goodfellow_ian,@soumithchintala,@D_Berthelot_ML,@josh_tobin_ ,@mcleavey and@AlecRad.Prikaži ovu nit
Kraj razgovora
Novi razgovor -
Čini se da učitavanje traje već neko vrijeme.
Twitter je možda preopterećen ili ima kratkotrajnih poteškoća u radu. Pokušajte ponovno ili potražite dodatne informacije u odjeljku Status Twittera.