We had the same issue! The closest we could ever get to an estimate was via this paper conferences.sigcomm.org/imc/2011/docs/
Conversation
Thanks Megan! That is helpful. Also, did you see my tweet about the ID structure of Youtube ids? I'm not entirely convinced that their ids are truly random. If there was any structure to the ids, that might give clues as to the level of activity for a time range.
1
That's interesting -- my assumption was that they're random. I think that's the assumption the aforementioned paper relies on as well... I think it's entirely possible that the IDs are randomly selected since the ID space is so large, but I have no clue if that's true...
1
In my previous tweet, I showed that the 11 digit ids are a slightly modified base64 representation of a 64 bit integer. We know Twitter uses the Snowflake algo to embed time data and server/node and datacenter ids so I'm wondering if Youtube's ID scheme uses something similar.
1
For me it makes sense to aggregate data on a channel level, a large-scale crawl could use "subscribed lists" and get the largest component of the node. A good place to start is social blade data. I crawled some (72M videos)
1
1
I have some ideas about what I would do differently if I was starting again! Especially now that I have finished a first research project with this data... Would be keen to share :)
1
2
Ditto! I've sampled channels the same way that I sampled YouTube videos, and used that to try and understand what's happening re: both samples
2
1
By the way, have you noticed oddities when using Youtube's v3.0 API? For instance, asking to sort by view count doesn't always return what you would expect it to return -- it is as if it isn't working off the full dataset or doing some optimizations that cause gaps in data.
1
Right, I find the search endpoint to be generally unreliable -- Search results differ from time to time, even with the same input due to how their underlying system generates search results for given topics
1
1
Yeah that makes proper data analysis next to impossible unless you are comfortable making a lot of assumptions which might not even be accurate. It is rather frustrating. I can't even depend on it to generate a report on the top 100 most viewed videos.
Also, the search endpoint consumes 100 credits per search!
1


