Conversation
You will need to switch from legacy SQL on the Google BQ dashboard. Just try to do a simple select:
SELECT * FROM `pushshift.rt_reddit.comments` LIMIT 10
The dashboard should be located here: bigquery.cloud.google.com
1
Replying to
That's great news. The feed should be very close to real-time (a 1-3 second delay at most usually).
1
Replying to
So is the idea to provide recent comments only, or will you be pulling in the backlog too and letting people utilise partitions?
1
Replying to
I am starting the real-time ingest now so going forward, people can see what's going on with Reddit on an hourly or daily basic (the table is partitioned by day). There is also the monthly data tables that include more accurate scores (since they have had time to settle)
1
This isn't meant for historical research -- more along the lines of hourly or daily recaps and analysis going forward.
1
Replying to
Why would I use this over, say, the PushShift API? Because I prefer BQ?
1
1
Replying to
#bigquery would be far faster if you wanted to do complicated analysis involving regex, etc. With this system, you could very quickly find all comments with a very specific regex (something the API lacks in support). So this provides a lot more power in certain areas.
Replying to
Oh, good point. Instead of doing it in a two-step process with extracting and filtering a bunch of comments.
1
I have some very detailed SQL queries that show amazing summaries that I will be sharing soon (still working on the Submission ingest into BigQuery). I'll include a write-up soon of all the differences between the API and BQ -- but this is really nice to have.

