musl's strlen() is a work of art except for the hopefully-harmless UB http://git.musl-libc.org/cgit/musl/tree/src/string/strlen.c …
-
-
Replying to @johnregehr
@johnregehr The aliasing? I plan to fix it with an __attribute__((__may_alias__)) type conditional on __GNUC__, fallback to naive if !GNUC.1 reply 0 retweets 0 likes -
Replying to @RichFelker
@johnregehr Actually, for 32-bit words, a naive 16x unrolled byte-based strlen blows away everything else in performance.1 reply 0 retweets 1 like -
Replying to @RichFelker
@johnregehr It consist of 100% correctly-predicted branches until end-of-string, and achieves ~1 cycle/byte even on my slow Atoms.1 reply 0 retweets 2 likes -
-
Replying to @johnregehr
@johnregehr The bad news: compilers won't generate this unrolled code even if you ask them to. You have to write it by hand...2 replies 0 retweets 0 likes -
Replying to @RichFelker
@RichFelker gcc 5.3 gives me a nice 8-way unroll with -funroll-all-loops -O31 reply 0 retweets 0 likes -
Replying to @johnregehr
@johnregehr What's your idea of a "nice unroll"? I get cmpb;leaq;je per iteration. Instead it should just do cmpb;je.1 reply 0 retweets 0 likes -
Replying to @RichFelker
@johnregehr This change makes a 2x performance difference, from "slowest reasonable strlen" to "fastest strlen".1 reply 0 retweets 0 likes -
Replying to @RichFelker
@RichFelker ok, I was figuring a modern core could run those extra insns in parallel1 reply 0 retweets 0 likes
@johnregehr All 3 in parallel? The cmpb and je (if correctly predicted) do run in parallel and that's how we get 1 cycle/byte.
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.