Can I get some feedback from people that use Pushshift Reddit data -- should Reddit dumps go into daily files or should I keep the monthly format? There are pros and cons to both but I wanted to see what the community thought.
Conversation
Replying to
Monthly keeps them more manageable for me. I think it is easier to drill down to daily trends as is.
1
2
Replying to
The one big benefit of daily dumps is that the daily dump when compressed would generally fall beneath 512MB in size which means Cloudflare would cache the files making them much faster to download (if they remain in cache which they should)
1
4
Show replies
Replying to
Personally I prefer the monthly dumps since I sometimes need to decompress before working with them and only having a handful of files makes it easier to do by hand. The flip side is daily (or monthly) dumps would be easier to work with from a disk space perspective. hmm 🤔
Replying to
Replying to
Can I vote 1000x for daily dumps? They're better in every way: smaller, better caching, quicker updates, easier to parallel process on modest hardware. If someone can't iterate 1-31 to download each, you're probably not able to do meaningful data science.




