Conversation

Cost of -fstack-protector-all is a higher range than 2% to 5%. Lots of papers to look at with estimates of performance cost for SSP in various non-IO-bound micro / macro benchmarks. I'm talking about by itself without other mitigations like ShadowCallStack or SafeStack enabled.
1
It's not quite that much when using type-based CFI and ShadowCallStack/SafeStack because cost of indirect calls and returns has already increased. Far less useful in that scenario though. SSP has a terrible cost vs. benefit trade-off. Barely mitigates any modern vulnerabilities.
1
Doing -fsanitize=bounds,object-size -fsanitize-trap=bounds,object-size -D_FORTIFY_SOURCE=2 detects a substantial portion of stack buffer overflows before they happen vs. SSP weakly mitigating linear overflows specifically. SCS cost isn't much more than SSP but works much better.
2
I agree there are stronger & cheaper mitigations, but I’m a little surprised that a few extra pushes, pops, & branches add up to 2-5%. Some people on Twitter claim correctly predicted branches are free (which is dubious). There are lots of cpus without arm pac or intel cet around
1
1
The cost is primarily reading and writing the canary on the stack. It also reads the expected value from global or thread local storage so there are 2 reads and a write, and then a branch based on the 2 reads that's set up so prediction is trivial but still wastes CPU resources.
1
It's a lot of overhead to add to every function that's using an array or taking address of a local (determined after IR optimizations). ShadowCallStack is a replacement for return address on the stack. It still adds a write because they leave the return address for unwinding.
1
ShadowCallStack has a broader threat model protecting return addresses in general by having them in an isolated memory region which can be given a separate random base that's far harder to leak. It's cheaper than SSP for a given function but strong SSP gets to skip many of them.
1
ShadowCallStack's only redundancy is still writing out the standard return addresses. No extra branches, no extra reads and only writing out that standard return address while never actually reading the value unless there's unwinding where performance doesn't matter at all.
1
Hardware shadow stack support like CET is ideal and I think it could easily perform better than not having structured function calls. It probably wouldn't in practice unless they designed the architecture to do function calls that way from the beginning and then built around it.
1
Not really interested in trying experimental userspace CET SS patches and then figuring out if the support included in GCC/Clang and glibc even works properly. I know how Clang/GCC SSP, Clang CFI and Clang SCS perform on various CPUs but no clue about CET or ARM PAC/BTI/MTE.
1
Show replies