The problem isn’t MongoDB per se (I’m using python to collect)—it’s that I don’t have a queuing solution in place and the script drops tweets or fails when tweets come in too quickly. I looked at RabbitMQ but hadn’t managed to get it working.https://twitter.com/faineg/status/1203111652956217349 …
-
-
Interesting. How are you handling analysis of large collections (1M+ tweets) after the fact?
-
For most analyses (like this one), mostly Jupyter notebook. You can read one line at a time from the jsonl file so you don't need to load it all into memory if too big. But even if you then wanted to stick it into mongo the jsonlines file makes it pretty straight forward.
- 4 more replies
New conversation -
-
-
Although, the best way is probably just to use twarc (h/t
@edsu) which is designed for this type of use case. Save it to jsonl then process it later. $ twarc filter demDronesDoe > stream.jsonlhttps://github.com/DocNow/twarc -
Thanks :-) twarc isnt the prettiest code ever, but it does have a pretty simple CLI, and can be easily used as a library. I guess the best thing is that it has logic to recover from dropped connections.
- 4 more replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.