“I love to learn new things and build things,” the algorithm wrote, when asked to generate an About Me page. “I have a <a href=“https://github.com/davidcelis”> Github</a> account.”
While the About Me page was supposedly generated for a fake person, that link goes to the GitHub profile of David Celis, who The Verge can confirm is not a figment of Copilot’s imagination. Celis is a coder and GitHub user with popular repositories, and even formerly worked at the company.
“I’m not surprised that my public repositories are a part of the training data for Copilot”
“I’m not surprised that my public repositories are a part of the training data for Copilot,” Celis told The Verge, adding that he was amused by the algorithm reciting his name. But while he doesn’t mind his name being spit out by an algorithm that parrots its training data, Celis is concerned at the copyright implications of GitHub scooping up any code it can find to better its AI.
When GitHub announced Copilot on June 29, the company said that the algorithm had been trained on publicly available code posted to GitHub. Nat Friedman, GitHub’s CEO, has written on forums like Hacker News and Twitter that the company is legally in the clear. “Training machine learning models on publicly available data is considered fair use across the machine learning community,” the Copilot page says.
But the legal question isn’t as settled as Friedman makes it sound — and the confusion reaches far beyond just GitHub. Artificial intelligence algorithms only function due to massive amounts of data they analyze, and much of that data comes from the open internet. An easy example would be ImageNet, perhaps the most influential AI training dataset, which is entirely made up of publicly available images that ImageNet creators do not own. If a court were to say that using this easily accessible data isn’t legal, it could make training AI systems vastly more expensive and less transparent.
The details change when an algorithm generates media of its own
Despite GitHub’s assertion, there is no direct legal precedent in the US that upholds publicly available training data as fair use, according to Mark Lemley and Bryan Casey of Stanford Law School, who published a paper last year about AI datasets and fair use in the Texas Law Review.
That doesn’t mean they are against it: Lemley and Casey write that publicly available data should be considered fair use, for the betterment of algorithms and to conform to the norms of the machine learning community.
And there are past cases to support that opinion, they say. They consider the Google Books case, in which Google downloaded and indexed more than 20 million books to create a literary search database, to be similar to training an algorithm. The Supreme Court upheld Google’s fair use claim, on the grounds that the new tool was transformative of the original work and broadly beneficial to readers and authors.
“There is not controversy around the ability to put all that copyrighted material into a database for a machine to read it,” Casey says about the Google Books case. “What a machine then outputs is still blurry and going to be figured out.”
This means the details change when the algorithm then generates media of its own. Lemley and Casey argue in their paper that if an algorithm begins to generate songs in the style of Ariana Grande, or directly rip off a coder’s novel solution to a problem, the fair use designation gets much murkier.
Since this hasn’t been directly tested in a court, a judge hasn’t been forced to decide how extractive the technology really is: If an AI algorithm turns the copyrighted work into a profitable technology, then it wouldn’t be out of the realm of possibility for a judge to decide that its creator should pay or otherwise credit for what they take.
But on the other hand, if a judge were to decide that GitHub’s style of training on publicly available code was fair use, it would squash the need for GitHub and OpenAI to cite the licenses of the coders that wrote its training data. For instance, Celis, whose GitHub profile was generated by Copilot, says he uses the Creative Commons Attribution 3.0 Unported License, which requires attribution for derivative works.
“And I fall in the camp that believes Copilot’s generated code is absolutely derivative work,” he told The Verge.
Until this is decided in a court, however, there’s no clear ruling on whether this practice is legal.
“My hope is that people would be happy to have their code used for training,” Lemley says. “Not for it to show up verbatim in someone else’s work necessarily, but we’re all better off if we have better-trained AIs.”