Microsoft's Epic Lawsuit around GitHub Copilot for software piracy explained

It's STILL the A.I. wild-wild west. How could such a cute mascot be a pirate in disguise you ask?

Nov 11, 2022

∙ Paid

Hey Guys,

Congrats to you readers, nearly 8,000 of you read A.I. Supremacy (link to web view) now. For the immense effort I’m still trying to find a business model that makes sense. Perhaps your vote can help steer me in the right direction:

This Poll will help me get your valuable feedback on how to proceed. Since I do want to be reader-centric as far as possible.

Check out my archives while they are still free, I’ve written a lot about A.I. topics so far in 2022. If I get into Sponsored Ads more they will enable me to offer more free posts.

As I approach my 1-year anniversary on Substack, I’m open to offer sponsored Ads on this Newsletter. Anyway enough of that administrative stuff. Let’s dive into today’s topic now.

AI-driven coding tool might generate other people's code – who knew? Well, Redmond, for one

What’s copyright and piracy in the era of Generative A.I. always? Let’s just train this tool on everyone else’s data.

Yet Microsoft went ahead with it anyway after acquiring GitHub and using OpenAI’s abilities. It’s a pretty serious matter in terms of the kind of copyright lawsuits Generative A.I. is going to face. It’s software piracy at least on some level!

GitHub Copilot – a programming auto-suggestion tool trained from public source code on the internet – has been caught generating what appears to be copyrighted code, prompting an attorney to look into a possible copyright infringement claim.

While I believe the Generation of code and auto-complete enables software developers to be more productive, there is a cost. At a times when A.I. ethics, safety and open-sources accessibility is being prioritized, the business reality doesn’t actually reflect it.

This lawsuit represents a growing concern from programmers, artists, and other people that AI systems may be using their code, artwork, and other data without permission.

On October 17th, 2022 Matthew Butterick, a lawyer, designer, and developer, announced he is working with Joseph Saveri Law Firm to investigate the possibility of filing a copyright claim against GitHub. There are two potential lines of attack here: is GitHub improperly training Copilot on open source code, and is the tool improperly emitting other people's copyrighted work – pulled from the training data – to suggest code snippets to users?

A lot of folk have been warning about this issue. In June 2022, Matt wrote about the legal problems with GitHub Copilot, in particular its mishandling of open-source licenses.

Microsoft needs to get its act together, while profiteering on Generative A.I, we need more A.I. ethics and rule of law here obviously.

This is after all, one of Microsoft’s subscriptions of AI-as-a-Service. GitHub Copilot is a cloud-based intelligent tool that analyzes existing code to suggest lines of code and entire functions in real-time directly within the editor. The extension is available in integrated development environments such as Visual Studio, Visual Studio Code, Neovim, and JetBrains IDEs. GitHub Copilot is available for all developers for $10/month and $100/year.

There are a host of rising competitors to Github Copilot, but it perhaps has the most swag so far. I can see Matt’s point though, Butterick has been critical of Copilot since its launch. In June he published a blog post arguing that "any code generated by Copilot may contain lurking license or IP violations," and thus should be avoided.

Everyone was afraid something would happen like this when Microsoft acquired GitHub for just $7.5 Billion back in 2018. Let’s try to unpack Matt’s position here: You see, according to OpenAI, Codex was trained on “tens of millions of public repositories” including code on GitHub. Microsoft has emphasized Copilot’s ability to suggest larger blocks of code, like the entire body of a function.

It’s all very worrisome if you care about A.I. ethics at all. Microsoft itself has vaguely described the training material as “billions of lines of public code”. But Copilot researcher Eddie Aftandilian confirmed in a recent podcast (@ 36:40) that Copilot is “train[ed] on public repos on GitHub”.

Microsoft and OpenAI must be relying on a fair-use argument. In fact we know this is so, because former GitHub CEO Nat Friedman claimed during the Copilot technical preview that “training [machine-learning] systems on public data is fair use”.

Well—is it? The answer isn’t a matter of opinion; it’s a matter of law. Naturally, Microsoft, OpenAI, and other researchers have been promoting the fair-use argument. Nat Friedman further asserted that there is “jurisprudence” on fair use that is “broadly relied upon by the machine[-]learning community”. But Software Freedom Conservancy disagreed, and pressed Microsoft for evidence to support its position. According to SFC director Bradley Kuhn—

[W]e inquired privately with Friedman and other Microsoft and GitHub representatives in June 2021, asking for solid legal references for GitHub’s public legal positions … They provided none.

Why couldn’t Microsoft produce any legal authority for its position? Because SFC is correct: there isn’t any. Or so Claims Matt.

There’s clearly a rift in the community. That same month, Denver Gingerich and Bradley Kuhn of the Software Freedom Conservancy (SFC) said their organization would stop using GitHub, largely as a result of Microsoft and GitHub releasing Copilot without addressing concerns about how the machine-learning model dealt with different open source licensing requirements.

Some people aren’t necessarily happy with the heavy handed way BigTech is going about this. Training AI to auto-suggest code on what amounts to the work of others.

Copilot is powered by Codex, an AI system that was created by OpenAI and licensed to Microsoft. According to OpenAI, Codex was trained on “millions of public repositories” and is “an instance of transformative fair use.”

No Legal Precedent

The problem is all of this Generative A.I. are entering grey areas where the profit motive means A.I. ethics, regulations and rule of law are totally absent.

For instance, even if a court ultimately rules that certain kinds of AI training are fair use—which seems possible—it may also rule out others. As of today, we have no idea where Copilot or Codex sits on that spectrum. Neither does Microsoft nor OpenAI.

Since its launch, the developer community has heavily criticized Microsoft’s GitHub Copilot due to potential copyright violations. Microsoft of course has turned a blind eye.

It’s a trial for Generative A.I. and not just Microsoft as there are many related cases and instances of how this works. Microsoft, GitHub, and OpenAI are being sued for allegedly violating copyright law by reproducing open-source code using AI.

Microsoft, its subsidiary GitHub, and its business partner OpenAI are all basically complicit. No attribution, nothing.

Copilot's capacity to copy code verbatim, or nearly so, surfaced last week when Tim Davis, a professor of computer science and engineering at Texas A&M University, found that Copilot, when prompted, would reproduce his copyrighted sparse matrix transposition code.

What’s the Big Deal?

Keep reading with a 7-day free trial

Subscribe to AI Supremacy to keep reading this post and get 7 days of free access to the full post archives.