Very fascinating result here. Doing a "for line in sys.stdin" in Python is much slower than sys.stdin.buffer.read and I'm not sure why. The I/O bottleneck is completely removed when sending chunks of ndjson to multiprocessing workers instead of reading line by line.
Conversation
For instance ...
import sys
buf = b''
while True:
data = sys.stdin.buffer.read(4096*64)
l = len(data)
pos = data.rfind(b'\n')
chunk = buf + data[0:pos]
buf = data[pos:]
if not data:
break
3
Replying to
Assuming that reading stdin hits disk somewhere, in this approach you are reading 256K of data upfront from "slow media" and then finding the line in-memory in this 256K 'data' buffer. "for line in sys.stdin" is having to go to slow media for _each_ line. I assume your code ...
1
.. above is incomplete because I'd expect a for-loop after reading 256KB from stdin in which you keep looking for subsequent lines in 'data' and keep dispatching them to your pipelines. You stop only when what's left is a partial ndjson, then you "refill" from stdin. HTH.
1

