There are fancy approaches but generic hash-based load balancing works fine.
Create a hash table equal in size to number of cores. Each bucket has a hash table with a separate lock. Inner hash tables reuse the hashing work for the outer one by shifting out bits that were used.
I don't think it's relevant to this use case since I don't think you really want more than one lock per object. The goal seems to be simply providing safety rather than a good way to do concurrency with fine-grained locking.
I think Python's issue is mostly CPython extensions.