25.5 million #Gab Posts are now available. I have cleaned up the structure. The file format is ndjson. This is the first dump (largest) with additional data coming soon. Working on creating searchable ES indexes.
Location: files.pushshift.io/misc/GAB_posts
#datascience #bigdata #datasets
Conversation
Replying to
Great work mate! This dataset is sadly likely hugely useful to reflect on society :S
As an aside, this is the first time I've heard NDJSON - I'm used to JSONL and I'm reassured that they seem largely equivalent :)
1
3
Do you know any fast JSON parsers/benchmarks/any info how to work with terabytes of NDSON? Hints appreciated :)
1
1
ndjson is just new line delimited json objects. You can use Python to easily read the data. Ping me if you need any assistance. I have a lot of various scripts laying around.
Sure, I was trying to get the disk-speed processing, but was not able even with ujson. Maybe have to try more :)
1


