I am shaking with excitement right now. This paper (conferences.sigcomm.org/imc/2011/docs/) did an analysis of Youtube ids and found the distribution to basically normal when they looked at a generic sample of ids. I wrote a script to convert the Youtube string id to its int64 representation
Conversation
and then converted the int64 to a string of 1s and 0s to examine each byte position. When I looked at a generic sample, I saw very little deviation from .5 for each set bit (meaning that there was close to 50% probability that each bit would be a 1 or a 0) which we would expect
1
3
if the distribution was normal. However, I then took a different sample where the publishedtime of the video had a specific second (the second would end in 1 or some other value) which would make the sample representative of correlated timestamp values. When I ran the
1
2
distribution for each bit, I found bits well outside the normal distribution and the bits were always the same when using a sample with specific timestamps.
What this means is that it appears that the Youtube ids have structure that correlates to the published time. It
1
1
6
appears they took the binary data for the timestamp and placed the bits in specific areas to obfuscate it enough so that the ids would appear random when analyzing a generic sample that isn't correlated to anything specific (like the published time). I need to get some
1
1
6
researchers to walk through what I'm doing to see if this is indeed what I suspected at first -- that Youtube ids are like Twitter's snowflake algo but more obfuscated.
WOW -- this would be a HUGE finding because we can reduce the id space substantially and associate
Replying to
other samples that are correlated with other parts of the video metadata and find a method to ping ids with a much greater chance of selecting real ids that exist.
1
1
6
Replying to
Hi Jason,
It's awesome that you share your "inquiry" steps out there. Once you have some kind of report, I'd be really happy to give you feedback or simply replicate it (I've done the same analysis for Tiktok IDs and could "only" reduce the space up to 20k ids/seconds)
1

