Overall I think our parallelism model is sound (per-device forward and backwards pass on sub-batches + CPU concat, keep params on CPU)
-
-
But it would need auditing, in particular to check whether we correctly place the grads computation on each device (now we leave it to TF)
1 reply 0 retweets 1 like -
Replying to @fchollet @bzamecnik
His NVIDIA DevBox is PCIe 16x, Azure is PCIe 8x, custom builds often even worse. Scaling also impacted by small batch sizes (64 -> 32).
1 reply 0 retweets 0 likes -
Overally we focus on small batch sizes, as I see huge sample sizes in many real world applications, including internally at
@RossumAi.2 replies 0 retweets 0 likes -
We have significant overhead (shuffling the params from CPU to GPU at every step; split+concat). Need time(process(sub_batch)) >> overhead
1 reply 0 retweets 1 like -
Not doable to reduce the overhead, so to have a speedup you need to keep your per-sub-batch processing time high (large models or batches)
1 reply 0 retweets 1 like -
We have ideas to reduce the overhead - https://github.com/rossumai/keras-multi-gpu/blob/master/blog/docs/conclusion.md … Most of it is the samples, first step: StagingArea https://gist.github.com/bzamecnik/f76e480edf98e95ab263fd1a123af7a5 …pic.twitter.com/QdnICU4VAm
2 replies 0 retweets 0 likes -
This is just async prefetch to 1 GPU. Then it needs StagingArea placed at each GPU. Also we need async loading to TF (Dataset API/queue).
1 reply 0 retweets 0 likes -
Haven't look at it in too much detail for the time being, but is there a way to package such optimizations as a PR for `multi_gpu_model`?
3 replies 0 retweets 1 like -
If we manage to get it right and see performance improvement, yeah, it would be good to make it convenient.
1 reply 0 retweets 2 likes
Cool, looking forward to it :)
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.