25.5 million #Gab Posts are now available. I have cleaned up the structure. The file format is ndjson. This is the first dump (largest) with additional data coming soon. Working on creating searchable ES indexes.
Location: files.pushshift.io/misc/GAB_posts
#datascience #bigdata #datasets
Conversation
Replying to
Great work! Very interested. Is there a docs I can read to understand how data were collected? Streaming or retrospective scraping? Particularly interested in whether the dataset includes deleted posts.
1
1
PS, ndjson suggests streaming collection but wanted to check.
1
1
Replying to
Some of the posts were collected historically -- but we are real-time ingesting. I'll be releasing the first "corpus" soon which will have a better explanation -- but you can look at the created time vs retrieved time to get a sense of how quickly it was captured.
1
Replying to
Awesome. This was sth I've been meaning to look at but I was put off as I had no API access, unlike Twitter. Having issues tho. When parsing and flattening, is ndjson expanding in memory as .jsonl files do?
2
1
Replying to
ndjson is essentially just json blobs separated by new lines so you can read the file in Python line by line and json.loads each line. If you're going to do some deep analysis I would wait until the corpus gets released in a bit.
Replying to
Thanks! Shame that I am a py noob so I'd rather stay in R unless absolutely necessary to jump ship.
Master , can I read this large ndjson file using a combination of `readLines` and `ndjson::stream_in` and export each entry rather than trying to read the whole file?
1
And yes, , as you can tell, am quite excited about this data! I'd deffo like to read detailed dox about the compilation of the dataset, if possible. Reviewers would ask for it I were to use this dataset to publish findings.

