Pushshift is recompressing all Reddit monthly dumps into a common compression scheme (zst). This should be completed within a few days. There were some files compressed as xz, bz2, etc. So it was a bit annoying.
Going forward I will use zstandard for all compression.
Conversation
Replying to
That's great, zstd is so much faster to decompress!
Bit late, but would you be open to storing N concatenated zst files, where each is a single subreddit's entries?
People could continue using it as one big file, or could build an index to fetch just the reddits they care about
1
Replying to
That's an interesting idea. Longer term I am planning to make a go program that accesses binary files where each segment is a dict compressed reddit comment or submission. Then I can create an index so that one could just grab all comments for a specific subreddit ...
1
2
I've been playing with this and it looks like it can grab the data and decompress at a rate of 50Mbps so I could make an API where people can download specific things (authors, subreddits, etc.) ....
1
2
Replying to
Yeah -- the goal would be to eventually have all sorts of social media objects available in that format. On an SSD / NVMe, the index + record lookups are extremely fast. Dictionary compression per object is still decent (~18% of original size).
1
1
Technically you could do this sort of thing with Postgresql, but this would be far faster because much less overhead.
(It's basically a fancy dict compressed key / value store with supporting indexes).
1
1

