25.5 million #Gab Posts are now available. I have cleaned up the structure. The file format is ndjson. This is the first dump (largest) with additional data coming soon. Working on creating searchable ES indexes.
Location: files.pushshift.io/misc/GAB_posts
#datascience #bigdata #datasets
Conversation
Replying to
Great work mate! This dataset is sadly likely hugely useful to reflect on society :S
As an aside, this is the first time I've heard NDJSON - I'm used to JSONL and I'm reassured that they seem largely equivalent :)
1
3
Do you know any fast JSON parsers/benchmarks/any info how to work with terabytes of NDSON? Hints appreciated :)
1
1
Show replies
This Tweet was deleted by the Tweet author. Learn more
Replying to
You can parse out whatever fields you want with Python. "jq" is a program on linux / max that makes working with ndjson very easy. Ping me personally and I'll help.
1
2
Show replies
Replying to
This looks very helpful. I see the file is 3.9GB but I'm being told "2 days" to download it. I currently have a 146.8 mbps download speed, so I'm wondering if there is throttling on your end. Thoughts?
1
1
This Tweet was deleted by the Tweet author. Learn more
Replying to
We're filling in the rest of the data now but the current available file is the entire history up through August 29, 2018.
1
Replying to
Great work! Very interested. Is there a docs I can read to understand how data were collected? Streaming or retrospective scraping? Particularly interested in whether the dataset includes deleted posts.
1
1
PS, ndjson suggests streaming collection but wanted to check.
1
1
Show replies





