Happy Friday! Today I will be dumping to the file repo (files.pushshift.io) the first quarter comments for 2020. It took a while to compress the data but considerably less time than previously with my new AMD 5950x processor (I swear I'm not plugging for AMD!) This weekend
Conversation
I should be finishing up second and third quarter 2020 and by early next week I'll have 4th quarter and 1st quarter 2021. This will get the file repositories caught up for Reddit data for researchers.
1
9
The file format will be the same as previous files, however we are switching to a daily dump pipeline. I will also be providing a simple python script to make it easier for researchers to download bulk days from the file repositories.
1
11
The filenames will take the formats:
RC_YYYY-MM-DD for comments and
RS_YYYY-MM-DD for submissions.
Let me know if you have any questions!
Replying to
Do you have any info on what might cause small differences in counts between historical downloads from the pushshift API vs. the archived files?
Show more replies


