Hmmmm, from the source it looks like non-vectorized, single threaded, not particularly cache friendly loop. How come it is faster than BLAS? How do you compare / measure?
-
-
-
OpenBlas shows around 25 GF/s. Haven’t tried MKL. The secret of this piece of code is that all loop sizes are known at compile time.
- Još 2 druga odgovora
Novi razgovor -
-
-
You are using a batch size of 8 though: at that size, if you use dgemm with maxed out threads the overhead kill you. Neat code though and cool pedagogical tool
-
Possibly. I've noticed that the overhead of a single call to s/dgemm is at least one order of magnitude higher. The nice thing about this gcc optimized version is that even very small (16x16 or so) are already quite fast. With batch=1 I get over 50% peak flops (i5, avx2)
- Još 2 druga odgovora
Novi razgovor -
-
-
Looks fab *but* splitting all this massive code in smaller functions should make it more readable?
Hvala. Twitter će to iskoristiti za poboljšanje vaše vremenske crte. PoništiPoništi
-
Čini se da učitavanje traje već neko vrijeme.
Twitter je možda preopterećen ili ima kratkotrajnih poteškoća u radu. Pokušajte ponovno ili potražite dodatne informacije u odjeljku Status Twittera.