In spaCy 2 we switched over to neural network models, so the bottleneck in spaCy comes down to matrix multiplication. Most Python libraries delegate CPU matrix multiplication to numpy, which then delegates it to a low-level library. Which library? Well, that depends. 2/10
-
-
Show this thread
-
There are three main libraries default numpy might delegate to. All have different problems. * Intel MKL: May not perform well on non-Intel CPUs * OpenBLAS: Often misdetects my CPU, leading to poor performance. * Accelerate (for OSX): Crashes if executed from a subprocess. 3/10
Show this thread -
Aside from the variation in problems, all of these matrix multiplication libraries will eagerly launch a tonne of threads. Most people see this as a good thing. It makes people happy to see their CPU working hard. But all these threads probably aren't helping you. 4/10
Show this thread -
When we used OpenBLAS for matrix multiplication, people kept reporting terrible performance, even though they were running on a 96 core machine and 96 child threads were being launched. The solution, OMP_NUM_THREADS=2, was not exactly obvious. 5/10
Show this thread -
The problem was that OpenBLAS -- like most other matrix multiplication libraries -- launched far too many threads for our relatively small workloads. This just caused a bunch of contention and switching costs, killing performance. 6/10
Show this thread -
Piping lots of data through a statistical model is an embarrassingly parallel workload. It's completely backwards to launch lots of threads for the *matrix multiplications*. That's the *lowest* level of computation! You want to parallellise at the *highest* level! 7/10
Show this thread -
Let's say you want to pipe 1 billion documents through spaCy. Great. Spin up 1,000 worker CPUs, give them 1 million documents each, and you'll be done in a few minutes. There's zero advantage to having an individual worker launching threads. You shouldn't want that. 8/10
Show this thread -
The place where the single-threading sucks at the moment is training. I hope a multi-processing solution won't be too hard to implement. I've also got some ideas for a software transactional memory strategy I've been meaning to try. 9/10
Show this thread -
With the new models, spaCy is running at around 8000 words per second on an n1-standard-1 machine on Google Compute Engine. This is a bit short of our target of 10k words per second, but still works out to more than 28m words parsed per $0.01, which ain't bad. 10/10
Show this thread
End of conversation
New conversation -
-
-
I was really looking forward for this fix
I was trying to POS-tag approx. 750K documents in order to pre-train POS tag embeddings. I had the anticipated idea to run this preprocess step in parallel, but the multithread configuration was a major bottleneck for the CPU threads. -
If you need to run the previous version for some reason, the multi-threading can be disabled by setting the environment variables OMP_NUM_THREADS=1 and MKL_NUM_THREADS=1 If you're using a Jupyter notebook, you can edit os.environ before importing numpy.
End of conversation
New conversation -
-
-
Getting OMP_NUM_THREADS=1 tatoo'd on my neck. Lesser academics will join my gang at conferences for protection.
-
Thanks for your interesting question question! Let's take this offline... then I'm taking you offline, punk.
End of conversation
New conversation -
-
-
@Safranf check this out!Thanks. Twitter will use this to make your timeline better. UndoUndo
-
-
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.
Author of the
Founder
1/10
