Conversation

Decompressing and processing 100k tweets a second challenge. First stage is figuring out the main bottleneck. time zstd -qcd tweets.zst > /dev/null Using Zstandard on one file, decompressing 11.47GB (16,869,703 tweets) in 112.68 seconds. (~ 101.7 MB/s or 149,713 tweets/s).
1
9
time zstd -qcd FH_2020-05-28_20.ndjson.zst | wc -l Piping into wc, we get a time of: real2m52.810s user2m6.414s sys1m46.441s (97,619 tweets per second). Probably can't get this much speed doing any real processing with the tweets so might have to go for 75k a second.
1
2
Replying to
Thing to figure out: 1) Why is zstandard decompressing so slowly for this particular file (does long distance matching slow down decompression?) 2) Why the large increase in time for Python sys.stdin? 3) Benchmark Go. Stay tuned.
1
5