The biggest problem I've so far with getting io_uring IO to compete with traditional sync IO is the cases where there are many processes constantly waiting for one IO to finish. That's the typical case for WAL when the changes made are all small.
Conversation
The case I can't quite get to compete doing asynchronously is lots of concurrent *tiny* commits that all fit on a single page. There's no good way that I have found, so far, for many different processes to wait for that IO, without doing it essentially synchronous.
2
1
Does Postgres have the concept of group commit, ala SQL Server?
1
Yes - that's actually kind of what causes the problem here - lots of commits that are flushed in one IO, which causes a lot of connections to wait for that one IO (be it directly, or via another notification operation).
1
For devices doing less than < ~15k QD1 durable write IOPS it's not a problem. But after that the async paths are currently loosing out :(
1
My torture workload does lots of tiny 56 byte large (in WAL terms) transactions. Synchronous path does ~340k TPS with ~26k IOPS; asynchronous path does ~230k TPS with ~13k IOPS.
With slightly bigger transactions, or slightly higher IO latency, the async path easily wins.
2
2
3
The wider context here is that those 56 byte transactions were designed to be as small as possible in order to represent the worst possible case, which is actually quite unrealistic -- right?
2
Right. I still can see a, smaller, slowdown with one-narrow-insert-per-tx. As soon as I go a bit wider than than that the issue first vanishes, and then, going bigger, get replaced by the benefits of multiple IOs in flight.
1
But I think there's a real issue that needs to be addressed - local storage latency is going to shrink further, and we don't want to make it even harder to take advantage. If this were a 5% regression in this extreme case, I'd not blink, but ...
1
I also see the overhead causing problems in the cases where AIO wins. I.e. there's a significant overhead in a number of scenarios due to the current "wait for IO to complete" logic in the AIO patchset, but the wins are bigger. But obviously we're missing out on larger gains.
1
Perhaps it will only be necessary to characterize and understand the problem in the near term. ISTM that the point isn't even the magnitude of the slowdown. ISTM that this is really about having confidence in the long term viability and adaptability of the initial design.
Perhaps I'm too perfectionist, but I don't think I'd want to merge a patch with that size of known regressions around core subsystems. Too likely that there's some regressed, more realistic, workloads.
I think this can be addressed, even improvements to the primitives!
1
One problem with getting to a more efficient solution is that lwlock.c's overhead in waking other backends is O(N), and that LWLockAcquireOrWait() use is necessary for some things, but has terrible thundering-herd problems.
1
1
Show replies


