Update on Pushshift ingest for real-time reddit data. I got an alert a while ago that comments had stopped ingesting. The reason for this is due to a huge spam increase over the past 6 hours on Reddit. Over 1.2 million comments were made to one subreddit alone.
Here's a look ...
Conversation
This increase did not overload my ability to ingest, however a different problem has surfaced. Before I go into more detail, here's a 5 minute window view of the past 48 hours of comment activity on Reddit. You can see the huge spam spike occurring towards the end.
1
1
2
So this brings us to the question of why the ingest has stopped ingesting comments. Some quick background -- Reddit's API allows you to ask for 100 objects at a time. A submission and a comment are considered objects in this example. Comment ids are sequential so my ingest ...
1
... will ask for 100 ids at a time sequentially and move forward. Reddit's system will eventually remove spam comments completely from the API. Usually this isn't a problem, because these "holes" in sequential ids are never larger than 100. However, if they are, the logic used ..
1
.. by my ingest fails terribly. The ingest only knows where it left off previously by the max id it saw. It then asks for 100 more ids. If it asks for 100 more ids and gets none back, it doesn't know if the ids were removed or if Reddit's API has stalled ... so it will keep...
Replying to
... asking for those 100 new ids until it slowly starts walking the min id in the range up by one. This logic usually handles extremely rare situations where there is a hole in Reddit's data because of removal of spam. In this latest situation, holes are scattered like ...
1
landmines across the comment ids. The ingest will slowly walk forward, but depending on how much data was removed on Reddit's end, the only solution is to risk data loss and jump ahead or remain behind until it walks through this swiss-cheese of comment deletions.
More ...
1
sophisticated ingest logic could be employed here, but as of right now, that's why comments are behind by an hour or two.
Hope that clarifies what's going on.
1
1
The next version of the ingest would employ a technique where it would check for ids 2, 4, 8, 16, 32, 64, etc. ahead of the last known max id and then it would know if it can ignore the gaps completely. Currently the API returns comments in order so I'd like to preserve that.
1
