Watch carefully as GitHub PR tries to (re)define copyright and set a precedent that the licensing terms of open source code don't apply in this case...
Left: Text from June 29th.
Right: Edited text on July 1st.
See FAQ section at copilot.github.com
Conversation
There was a mobile "face app" that had questionable terms of service, and the company used the data to train their models. In the ruling, they were ordered to delete the models.
I can't remember the company or app name, will post below when I find it...
Quote Tweet
Replying to @alexjc and @github
Do you have any examples of courts doing so? When I was looking for precedent a few years I couldn't find any.
2
6
87
Here it is. techcrunch.com/2021/01/12/ftc
GitHub is in a similar position because it's using data from users "without properly informing them what it was doing."
In short, "users who did not give express consent to such a use" applies here too since License terms would be broken.
4
28
122
It certainly is a different case. But just to refute the original premise: it's not generally considered fair use in the community.
It's a controversial topic and what we're seeing now is a coordinated PR campaign until GitHub can help set the standards.
Quote Tweet
Replying to @alexjc
I feel actively harvesting from individuals would get a different ruling to using data that has been published in the open (regardless of license).
But IANAL. 
4
7
75
This is a digital version of the Tragedy Of The Commons.
It's very different in practice as there's no scarcity digitally, but the impact is arguably bigger...
2
3
60
FWIW, even the Creative Commons website has been nudging people for years towards using licenses that can easily be exploited commercially.
You're not approved for "Free Culture" if you don't let multi-nationals profit from your work!
4
39
157
I think there's a small window of opportunity to ensure that new legislation doesn't benefit multi-nationals exclusively.
If this matter is left to sponsored Think Tanks, then regular people will suffer...
Quote Tweet
An obvious move to evade FTC, or a court, coming to the same conclusion as here? I am not a lawyer but there will be a lot of movement in data ethics and establishing legal foundations for AI in the near future. techcrunch.com/2021/01/12/ftc
3
7
51
They understand fully the importance of this.
They have a legal team working on it. They said they want to be a part of defining future standards. If that fails, they'll update their Terms Of Service.
Quote Tweet
Goes to show GitHub clearly does not understand the severity of the issue. twitter.com/alexjc/status/…
3
6
56
The law works differently than you think it does! If you have the money and the intention, *everything* can be challenged.
You ask your expensive lawyer: "I want to win this case and establish this precedent" and they'll find many options.
Quote Tweet
Replying to @alexjc
If they did, how'd this pass a regular legal audit?
2
7
60
Let's consider the fact there is no Fair Use directive in the EU, and in the US there are four factors that can be debated and argued until precedent is set in such cases.
Quote Tweet
Replying to @alexjc
While I agree with your general argument, fair use is a legal concept. The community's view may not be relevant.
1
1
42
When hosting code on GitHub, it would fall under the Terms Of Service of a traditional business relationship. We possibly have better protection & recourse (e.g. with EU Regulators or FTC, as above) than if repositories were hosted elsewhere... twitter.com/Donzanoid/stat
This Tweet is unavailable.
1
30
They say it's trained on public data, and I highly doubt they trained on private repositories.
GPT-3 and similar size models have been proven to learn samples verbatim, so it'd be a huge breach of contract if information leaked!
Quote Tweet
Replying to @alexjc
is there any evidence that github trained on private repos? that would certainly make it more similar to the ever case.
2
39
This is the core of the issue.
GitHub's Terms and Conditions do not allow it to train on your code or exploit it commercially. See section D here: docs.github.com/en/github/site
Quote Tweet
Replying to @kcimc and @alexjc
are you saying that there are no similar enough precedents, or that the T&C do not explicitly allow for this, or that code licenses do not explicitly allow for this?
1
19
66
GitHub can "parse" — but if you claim Deep Learning is just parsing will tell you it's extrapolation and that's difficult.
GitHub can "store" — but if you claim Deep Learning is just storing will find a paper from 1986 to prove you wrong.
3
5
54
Since they trained on content hosted on GitHub that already falls under their T&C, when they call it "Fair Use" it's a way to circumvent the explicit license — which could weaken their argument legally and makes it very different from other cases.
Quote Tweet
Replying to @alexjc
i mean “public” in the way that github repos are public: they are accessible to the public, and most are licensed for liberal use by the public. i read github as saying that they see this situation as comparable to the data behind other popular datasets, and i would agree.
2
1
37
Facebook regularly trains AI models on photos from Instagram, but I presume they have the right to do anything because their T&C now probably cover everything their lawyers could think of ;-)
3
1
32
Thank you Dan! This is an important thing to do: please make a statement publicly about this or retweet someone else's.
Quote Tweet
Replying to @alexjc
As a member of the machine learning community, I also reject the claim that this application is fair use. The code may be publicly available, but it is carefully licensed. This is large-scale license laundering.
1
3
46
It's easy for machines to track licenses, and companies have a responsibility of due care & legally bound to respect such contracts, hence they should do so!
3
31
Going to wrap up this thread...
TL;DR I'd also argue that GitHub is in a weak position on Fair Use, an even weaker because of their T&C explicitly does not allow "selling" our code as part of a derived-works like CoPilot.
1
7
51
Reopening this thread with more insights on Copyright and GitHub Copilot. 🔓🖋️
There were so many great / interesting replies I want to track them in one place...
1
1
8
ATTRIBUTION
GitHub is able to track whether generated snippets match the database on a character-level, but it's going to require better matching for users to do correct attribution of code snippets.
Quote Tweet
Good thread summarizing the shady side of @github's copilot.
Open source doesn't (always) mean anybody can do anything with it. E.g. Happy Coding requires attribution- same as Stack Overflow.
Can GitHub use that source as training data? Are users of the tool copying code? twitter.com/alexjc/status/…
Show this thread
1
1
10
LIABILITY
If the generated code has legal implications (e.g. incompatible license), it's the end-users that will be responsible.
Quote Tweet
Irrespective of whether it‘s fair use for MS, it‘s almost certainly not fair use for Copilot users. Copyright law does not care that code that has been copied verbatim or near-verbatim came via autocomplete
Copilot users will get sued, not Microsoft
twitter.com/alexjc/status/
1
2
13
This came up many times!
One solution would be to redesign the product, likely training multiple models based on what licenses are compatible with the end-user's requirements.
Quote Tweet
Not a lawyer, but if I were GitHub I would be super cautious and only train on code that was explicitly under some sort of permissive license. Hard to see how this doesn’t cause huge complications if GPL, etc. code is included in training data. twitter.com/alexjc/status/…
Show this thread
1
1
9
UNLICENSED CODE
Assuming all public repositories were used for training, even those without a license: this seems to be yet another case for their legal team to handle! It's different than claiming the right to train on GPL or non-commercial licenses.
Quote Tweet
It’s also important to realize that “publicly available” is not equivalent to “open source”.
There’s ambiguity in not specifying a license, but copyright is often upheld. twitter.com/alexjc/status/…
1
1
11
NEW LICENSES
This also came up many times! Should we design licenses that explicitly specify whether ML models can or cannot be trained on the code?
(Right now GitHub's claim of Fair Use is side-stepping the licenses anyway.)
Quote Tweet
I wonder if the Copilot thing will lead to the creation of licences aimed against for-profit ML applications. twitter.com/alexjc/status/…
1
1
11
VERBATIM CONTENT
Fast-inverse square root is memorized as-is with comments and license. If it can be used as a search engine, maybe it should be treated as such legally?
Quote Tweet
I don't want to say anything but that's not the right license Mr Copilot.
GIF
1
3
13
DERIVATIVE WORKS
For databases, the index is considered a separate copyright than the content: opendatacommons.org/faq/licenses/
However, models like GPT-3 combine both content & index so the whole thing is a derivative work of the content.
twitter.com/_m_libby_/stat
This Tweet is unavailable.
1
1
8
ACCOUNTABILITY
GitHub / Microsoft is accountable in many ways, it just needs someone to actually *hold* them accountable — otherwise (by default) the company is expected to maximize shareholder value.
twitter.com/MaxALittle/sta
This Tweet is unavailable.
1
10
MIGRATING PLATFORMS
By hosting code on GitHub, it falls under their Terms & Conditions — which (as of now) do *not* allow selling your code, also does not allow training.
In short, the legal options may actually be better by being on GitHub for now...
Quote Tweet
I couldn't possibly imagine a huge exodus from @github to @gitlab akin to the recent WhatsApp -> Signal migration. No one would do that, right? twitter.com/alexjc/status/…
1
9
LEGALEZE
These cases are generally fascinating because there are so many different overlapping aspects! Here, for Ever (face database), it was apparently not about copyright but informed consent for private data.
Quote Tweet
There's a great short story in "AI ordered to be deleted because it knows copyrighted things". twitter.com/alexjc/status/…
1
9
Show replies
















