Conversation

Decompressing and processing 100k tweets a second challenge. First stage is figuring out the main bottleneck. time zstd -qcd tweets.zst > /dev/null Using Zstandard on one file, decompressing 11.47GB (16,869,703 tweets) in 112.68 seconds. (~ 101.7 MB/s or 149,713 tweets/s).
1
9
time zstd -qcd FH_2020-05-28_20.ndjson.zst | wc -l Piping into wc, we get a time of: real2m52.810s user2m6.414s sys1m46.441s (97,619 tweets per second). Probably can't get this much speed doing any real processing with the tweets so might have to go for 75k a second.
1
2
Python script to just read from sys.stdin and do nothing with the data shows a lot of slow down: for line in sys.stdin: pass real6m20.229s user8m2.420s sys1m51.007s (44,367 tweets a second)
1
2