Ditto on this sampling info! We've been working on different ways to sample YouTube over the last couple of years and never fully resolved how representative it was
Conversation
It is extremely hard to get concrete numbers on just how many videos are uploaded on a per hour / day basis, etc. In fact, I can't even find a definitive source on how many total videos Youtube had up to a specific date.
1
We had the same issue! The closest we could ever get to an estimate was via this paper conferences.sigcomm.org/imc/2011/docs/
1
1
Thanks Megan! That is helpful. Also, did you see my tweet about the ID structure of Youtube ids? I'm not entirely convinced that their ids are truly random. If there was any structure to the ids, that might give clues as to the level of activity for a time range.
1
That's interesting -- my assumption was that they're random. I think that's the assumption the aforementioned paper relies on as well... I think it's entirely possible that the IDs are randomly selected since the ID space is so large, but I have no clue if that's true...
1
In my previous tweet, I showed that the 11 digit ids are a slightly modified base64 representation of a 64 bit integer. We know Twitter uses the Snowflake algo to embed time data and server/node and datacenter ids so I'm wondering if Youtube's ID scheme uses something similar.
1
For me it makes sense to aggregate data on a channel level, a large-scale crawl could use "subscribed lists" and get the largest component of the node. A good place to start is social blade data. I crawled some (72M videos)
1
1
I have some ideas about what I would do differently if I was starting again! Especially now that I have finished a first research project with this data... Would be keen to share :)
1
2
Ditto! I've sampled channels the same way that I sampled YouTube videos, and used that to try and understand what's happening re: both samples
2
1
You should grab my dump once I get it up (probably in less than an hour)


