Tech companies working with AI — from big names like Google and Meta to upstarts like Stability AI — are outsourcing data collection to academic/nonprofit research groups, shielding them from potential accountability and legal liability.
Conversation
Replying to
Non-commercial/academic usage provides a more favorable legal case for a "fair use" exemption to copyright law, while still allowing corporations to commercialize that research for whatever they like.
9
10
58
It also shields companies from complaints from artists, photographers, video creators, or just ordinary people who find their work in these datasets without consent, shifting the responsibility of opt-outs and GDPR compliance onto researchers. laion.ai/gdpr/
1
5
40
But that's just opting out from the dataset: once your data's trained a model, it can't easily unlearn it. (Not yet, anyway.) github.com/tamlhp/awesome
1
5
37
Digging into the datasets for these exciting new models is always eye-opening and I couldn't help but write about it. I mean, come on: where else would you find Meta using Microsoft using YouTube?
Quote Tweet
So in addition to a massive chunk of Shutterstock's video collection, Meta is also using YouTube videos collected by Microsoft to make its text-to-video AI! What a world.
Show this thread
1
11
41
Great question: Why did Meta use millions of videos from YouTube and Shutterstock for its text-to-video AI instead of videos from Instagram and Facebook? (Easier access for discovery and scraping? Concerns over user privacy?)
Quote Tweet
I'd really love to know the legal reasoning behind Meta not using its enormous pool of videos that people posted on its properties (Facebook and Instagram) and instead using much poorer quality (short, watermarked) videos nicked from Shutterstock. Any idea, @waxpancake ?
Show this thread
7
7
47
This is a good guess: YouTube and Shutterstock have better metadata than FB/Instagram. (YouTube is also speculated to be the source for the 680,000 hours of captioned audio used to train OpenAI’s new Whisper speech recognition model.)
Quote Tweet
Replying to @waxpancake
I would guess it’s simply because shutterstock and YT videos have proper titles and descriptions/tags/etc.
1
1
12
tell me you didn’t read the article without telling me you didn’t read the article
3



