Yes - that's actually kind of what causes the problem here - lots of commits that are flushed in one IO, which causes a lot of connections to wait for that one IO (be it directly, or via another notification operation).
Conversation
For devices doing less than < ~15k QD1 durable write IOPS it's not a problem. But after that the async paths are currently loosing out :(
1
My torture workload does lots of tiny 56 byte large (in WAL terms) transactions. Synchronous path does ~340k TPS with ~26k IOPS; asynchronous path does ~230k TPS with ~13k IOPS.
With slightly bigger transactions, or slightly higher IO latency, the async path easily wins.
2
2
3
The wider context here is that those 56 byte transactions were designed to be as small as possible in order to represent the worst possible case, which is actually quite unrealistic -- right?
2
Right. I still can see a, smaller, slowdown with one-narrow-insert-per-tx. As soon as I go a bit wider than than that the issue first vanishes, and then, going bigger, get replaced by the benefits of multiple IOs in flight.
1
But I think there's a real issue that needs to be addressed - local storage latency is going to shrink further, and we don't want to make it even harder to take advantage. If this were a 5% regression in this extreme case, I'd not blink, but ...
1
I also see the overhead causing problems in the cases where AIO wins. I.e. there's a significant overhead in a number of scenarios due to the current "wait for IO to complete" logic in the AIO patchset, but the wins are bigger. But obviously we're missing out on larger gains.
1
Perhaps it will only be necessary to characterize and understand the problem in the near term. ISTM that the point isn't even the magnitude of the slowdown. ISTM that this is really about having confidence in the long term viability and adaptability of the initial design.
1
Perhaps I'm too perfectionist, but I don't think I'd want to merge a patch with that size of known regressions around core subsystems. Too likely that there's some regressed, more realistic, workloads.
I think this can be addressed, even improvements to the primitives!
1
One problem with getting to a more efficient solution is that lwlock.c's overhead in waking other backends is O(N), and that LWLockAcquireOrWait() use is necessary for some things, but has terrible thundering-herd problems.
1
1
I intuited that there might be something like that, which was kind of my point. There might have been a factor like it involving a smaller regression, but with similar implications (it's less likely, but possible). Deeper point is about theoretical ideal behavior vs actual.
Well, everything "we" implement in userspace is going to be a bit less efficient than io_uring supporting granular wakeups. I mainly tried to explain why I would want those. Didn't want to imply we/I cannot narrow (or close) the gap between async / sync paths in userspace.
1
Fascinating. I need to learn more about pg internals. What’s the best place to start in your opinion?
1
Show replies


