If any software engineers that enjoy a little challenge have some time, we'd like to ingest the entire forum history of officer_dot_com. I've started by including some basic code examples to get going. If you would like to help tackle this, I'd be happy to jump on a call with you
Conversation
Replying to
Here is the public repo with a very basic example of grabbing metadata for each forum (there is more metadata that needs to be added but this should get you started).
github.com/pushshift/offi
1
1
4
The first step is to parse all the available metadata for each forum (which is already started in the script under the method "get_forum_data" Once we get all the metadata for all forums, then we iterate through each forum to get all threads. Then we iterate threads to get
1
the actual forum posts. So to recap:
1) Finish the get_forum_data method to add all available metadata (Chrome console is your friend here).
2) Create get_thread_data method.
3) Create get_post_data method.
1
My estimate is that if you have experience with scraping, this can probably get knocked out in 2-4 hours. I can supply whatever resources you need.
Replying to
I love that you're crowdsourcing this. If I wasn't absolutely slogged with work, I'd be all over this challenge like orange on an orange.
1
1
Replying to
Yeah I know exactly what you mean! I really want to get this data set and we will eventually -- I just don't have the cycles right now :(
Replying to
/ You might want to give this forum a look. There are some interesting posts in there.
3
Replying to
Have you found someone to scrape this for you yet? I'm ~1/6 of the way through the 5 million posts, assuming I didn't royally screw things up.
This is what I'm collecting:
read image description
ALT
read image description
ALT
1
3



