If you put me in charge of objc_msgSend, I think I’d have the compiler scatter a couple of polymorphic inline caches into the binary at each call site (using indirect jumps to satisfy W^X). I think it’d reduce insn count from 11 to ~6, and it would use the BTB better.
-
Show this thread
-
Specifically, a call to objc_msgSend would have this fast path: cbz self,cache_miss ; nil check ldp type/targetcache ; compiler generates a slot ldr self->isa cmp typecache,isa bne cache_miss br targetcache
2 replies 0 retweets 2 likesShow this thread -
Two problems: One, many call sites really want a two-entry cache (mutable and non-mutable class, for example). Two, it costs too much dirty memory to do this everywhere so you need some way to choose at compile time where to apply it.
2 replies 0 retweets 3 likes -
Yeah, you can chain them to make polymorphic ICs (all JS engines do this). Can you use PGO to determine which call sites are hot?
1 reply 0 retweets 1 like -
It’s hard to believe that devirtualization is never worth it. objc_msgSend is pretty much always a BTB miss and that’s gotta hurt…
2 replies 0 retweets 1 like -
That’s not really true on newer architectures. Branch prediction with history in practice predicts msgSend fairly well
2 replies 0 retweets 2 likes -
Also I didn't see anything about BTB history in https://xania.org/201602/bpu-part-three … and follow-up posts: it seems like just a plain old address cache...
1 reply 0 retweets 0 likes -
I don’t know much about the Intel side or the specific algorithms on any architecture. If I remember correctly Apple’s ARM CPU designers were once worried about the high mispredict rates they observed in objc_msgSend, but then they solved it with no software changes.
1 reply 0 retweets 2 likes -
My understanding is that history was always good enough, and the misprediction problem was just that objc_msgSend introduced so many indirect branches that it was blowing out their cache.
1 reply 0 retweets 0 likes
Thinking about it, I guess what Apple’s ARM CPU is probably doing is keying the BTB off not just PC like Intel does, but the (LR, PC) pair. That would give the behavior you describe, because LR is different for each objc_msgSend call.
-
-
Replying to @pcwalton @pathofshrines and
This is especially easy to do on ARM because you have a link register. Anyway, I think you could get the same effect on x86 w/o increasing mem usage by mmap-ing in objc_msgSend at different addresses and having the compiler emit calls to a random one at each site…VIPT abuse ;)
2 replies 0 retweets 0 likes -
The lower tech trick is to split up lookup and dispatch, so the branch itself always has a unique address. Even that had marginal or inconsistent benefit on x86 in tests from what I hear
1 reply 0 retweets 0 likes - 1 more reply
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.