I am shaking with excitement right now. This paper (conferences.sigcomm.org/imc/2011/docs/) did an analysis of Youtube ids and found the distribution to basically normal when they looked at a generic sample of ids. I wrote a script to convert the Youtube string id to its int64 representation
Conversation
Replying to
and then converted the int64 to a string of 1s and 0s to examine each byte position. When I looked at a generic sample, I saw very little deviation from .5 for each set bit (meaning that there was close to 50% probability that each bit would be a 1 or a 0) which we would expect
1
3
if the distribution was normal. However, I then took a different sample where the publishedtime of the video had a specific second (the second would end in 1 or some other value) which would make the sample representative of correlated timestamp values. When I ran the
1
2
distribution for each bit, I found bits well outside the normal distribution and the bits were always the same when using a sample with specific timestamps.
What this means is that it appears that the Youtube ids have structure that correlates to the published time. It
1
1
6
appears they took the binary data for the timestamp and placed the bits in specific areas to obfuscate it enough so that the ids would appear random when analyzing a generic sample that isn't correlated to anything specific (like the published time). I need to get some
1
1
6
researchers to walk through what I'm doing to see if this is indeed what I suspected at first -- that Youtube ids are like Twitter's snowflake algo but more obfuscated.
WOW -- this would be a HUGE finding because we can reduce the id space substantially and associate
2
11
other samples that are correlated with other parts of the video metadata and find a method to ping ids with a much greater chance of selecting real ids that exist.
1
1
6
