Question for twitter! There is a web archive with 15,000 links to downloadable spreadsheets (xlsx), each representing one entity. Say each spreadsheet is 20 rows (years), and 50 variables of interest. All spreadsheets are identically formatted. How do I automate getting it all?
-
Show this thread
-
IDEALLY what I would like is 50 final output files, one for each variable, where columns are years and rows are each of the 15,000 entities. Ideally, row name would be entity name... however neither the downloaded files nor the URLs contain entity name.
2 replies 0 retweets 0 likesShow this thread -
Rather, the entity name is a text field on the website. Actually, there's a non-trivial amount of metadata from the URLs I'd also like to strip. My impression is that this is not very complicated, but I don't know how to do it.
2 replies 0 retweets 0 likesShow this thread -
Replying to @lymanstoneky
You could put together a pandas script to do it in about three hours if you're unfamiliar but want to learn Or pay someone to do the same in about 30m
1 reply 0 retweets 0 likes
pandas is a python module that more or less mimics R there's another module, urllib, that has functions for fetching web pages and files That's pretty much all you need, along with some simple string manipulation
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.