25.5 million #Gab Posts are now available. I have cleaned up the structure. The file format is ndjson. This is the first dump (largest) with additional data coming soon. Working on creating searchable ES indexes.
Location: files.pushshift.io/misc/GAB_posts
#datascience #bigdata #datasets
Conversation
Replying to
Great work! Very interested. Is there a docs I can read to understand how data were collected? Streaming or retrospective scraping? Particularly interested in whether the dataset includes deleted posts.
1
1
PS, ndjson suggests streaming collection but wanted to check.
1
1
Replying to
Some of the posts were collected historically -- but we are real-time ingesting. I'll be releasing the first "corpus" soon which will have a better explanation -- but you can look at the created time vs retrieved time to get a sense of how quickly it was captured.
1
We'd also like to rescan to find deletions, etc.
Replying to
Awesome. This was sth I've been meaning to look at but I was put off as I had no API access, unlike Twitter. Having issues tho. When parsing and flattening, is ndjson expanding in memory as .jsonl files do?
2
1
After extracting the .xz file, I get a ~90gb .ndjson. Am trying to use ndjson::stream_in() in R on a > 200GB node on but I get memory limit errors. However, ndjson::validate() returns `TRUE`, indicating ndjson file is fine.
1
Show replies

