Update on the new Reddit ingest pipeline. The new ingest code is currently being tested for any esoteric bugs but so far things are looking really good.
Here are the targets for the new pipeline:
1) Max delay for indexing of new Reddit data will be targeted at 60 seconds.
Conversation
That means ideally that new comments and submissions will be available for search within 60 seconds of their creation on Reddit.
2) A median time of less than 15 seconds for newly indexed information. The goal is to have new objects available for search in around 15 seconds
1
2
Over 99% of the time.
3) Reduce latency of all searches to below 250ms.
I will continue testing for any edge-case bugs but my feeling is that this will be ready for production in 2-3 weeks.
Thank you!
2
1
Replying to
Nice work? Any changes to scores or are they still only once 24 hours after posting?
2
1
Replying to
Good question. I haven't thought that far ahead honestly but if you have some ideas, let me know!
1
Replying to
Nah, it's a tricky thing to solve.
Usually with systems I've seen its a queue of things to check later and when to check them based on the state they're in.
State 0, check in X minutes.
State 1, check in 2 hours.
etc until you hit the max state.
2
1
But this means your queue is huge and when asking for things to check, you'll get a big list of things in all sorts of states.
But the benefit is you'll get a "curve" of progression. And usually scores don't change much a day later.
1
1
Can even "optimize", score still 0 (or a small change) after a couple of states? Remove it from the queue.
1
1
Replying to
These are all great suggestions. There is another huge aspect to the ingest and it involves IO speed and making sure I don't over saturate the underlying storage. That isn't much of a concern for Reddit but things like Twitter it becomes more of an actual concern.
Replying to
With the volume of Twitter, Reddit seems like a dev environment in comparison.
2
1

