Conversation

Pushshift is recompressing all Reddit monthly dumps into a common compression scheme (zst). This should be completed within a few days. There were some files compressed as xz, bz2, etc. So it was a bit annoying. Going forward I will use zstandard for all compression.
3
23
Replying to
That's great, zstd is so much faster to decompress! Bit late, but would you be open to storing N concatenated zst files, where each is a single subreddit's entries? People could continue using it as one big file, or could build an index to fetch just the reddits they care about
1
Replying to
That's an interesting idea. Longer term I am planning to make a go program that accesses binary files where each segment is a dict compressed reddit comment or submission. Then I can create an index so that one could just grab all comments for a specific subreddit ...
1
2