Periodic reminder to never use -O3 unless you've already inspected the assembly generated at -O2 or -Os and are solving a specific issue and are committing to continue verifying it for each new compiler.https://twitter.com/iblueconnection/status/1201485834828091393 …
-
-
-
Replying to @rygorous @stephentyrone
Any intuition (or anecdotes) for why -fno-vectorize? Is it that if something is worth vectorizing then library authors usually have? And the optimization just causes noise?
4 replies 0 retweets 1 like -
Replying to @chewedwire @stephentyrone
Every time I turn it on our code gets slower and 20+ kb larger, then I find out why, file a bunch of bugs, and turn it off again.
1 reply 0 retweets 5 likes -
Replying to @rygorous @chewedwire
Compilers are just not (bogglingly, infuriatingly) not very good at vectorizing. Ragged counts and alignments are handled very inefficiently, loops tend to be overly unrolled, any sort of horizontal data motion brings the world to a halt. Compiler people keep telling me that \
1 reply 1 retweet 3 likes -
it's a solved problem, but autovectorization only works in practice on trivial examples, where you can write your own implementation in a few minutes that ends up going 20% faster anyway. I have never identified a good reason for it, but the academic community seems to believe \
1 reply 1 retweet 2 likes -
that it's "solved" and in industry we mostly have kernels that do the critical workloads already, or it's just stupid all-in-lane homogeneous arithmetic so really dumb compilers are fine, and so it doesn't improve.
2 replies 1 retweet 2 likes -
How about overtly parallel language semantics in the style of CUDA/ispc?
2 replies 0 retweets 0 likes -
Replying to @jckarter @stephentyrone and
Yeah, I personally found the best way to use vector instructions is to create a good SIMD library with GLSL-style shuffle syntax, etc. from the beginning and then use it everywhere.
1 reply 0 retweets 1 like
If I do that it’s surprising how often my vector code ends up not only faster than the equivalent scalar code but also clearer.
-
-
FWIW floats are the easy case still; most of my SIMD work is on (narrow) ints. Autovect is almost completely useless with that, but it's also annoying library-resistant if you target multiple archs because there's substantial divergences (both in what exists and what's fast).
1 reply 0 retweets 4 likes -
The other thing that happens a lot with integer SIMD is changing data type widths a lot during a computation and ISPC/CUDA etc. really can't express that either. Mostly you end up with 32 bits for everything which is not great.
1 reply 0 retweets 1 like - 2 more replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.