Decompressing and processing 100k tweets a second challenge. First stage is figuring out the main bottleneck.
time zstd -qcd tweets.zst > /dev/null
Using Zstandard on one file, decompressing 11.47GB (16,869,703 tweets) in 112.68 seconds. (~ 101.7 MB/s or 149,713 tweets/s).
Conversation
time zstd -qcd FH_2020-05-28_20.ndjson.zst | wc -l
Piping into wc, we get a time of:
real2m52.810s
user2m6.414s
sys1m46.441s
(97,619 tweets per second).
Probably can't get this much speed doing any real processing with the tweets so might have to go for 75k a second.
1
2
Python script to just read from sys.stdin and do nothing with the data shows a lot of slow down:
for line in sys.stdin:
pass
real6m20.229s
user8m2.420s
sys1m51.007s
(44,367 tweets a second)
1
2
Thing to figure out:
1) Why is zstandard decompressing so slowly for this particular file (does long distance matching slow down decompression?)
2) Why the large increase in time for Python sys.stdin?
3) Benchmark Go.
Stay tuned.
Replying to
1) Actually, zstandard is going much faster than 100 MB/s. It's processing the original compressed data at that speed, but outputting much faster (> 1Gbps)
3
