Pushshift is recompressing all Reddit monthly dumps into a common compression scheme (zst). This should be completed within a few days. There were some files compressed as xz, bz2, etc. So it was a bit annoying.
Going forward I will use zstandard for all compression.
Conversation
Replying to
(It takes quite a while to compress 10+ terabytes with high compression)
3
Replying to
That's great, zstd is so much faster to decompress!
Bit late, but would you be open to storing N concatenated zst files, where each is a single subreddit's entries?
People could continue using it as one big file, or could build an index to fetch just the reddits they care about
1
Replying to
That's an interesting idea. Longer term I am planning to make a go program that accesses binary files where each segment is a dict compressed reddit comment or submission. Then I can create an index so that one could just grab all comments for a specific subreddit ...
1
2
Show replies

