But agreed that we're waiting for a court case to confirm that. Although really, the main questions for any court cases are not going to be around the principle of fair use itself or whether training is transformative enough (it obviously is), but rather on the specifics:
1) Was any copyrighted material acquired legally (not applicable here), and
2) Is the LLM always providing a unique expression (e.g. not regurgitating books or libraries verbatim)
And in this particular case, they confirmed that the new implementation is 98.7% unique.
Training an LLM inherently requires making a copy of the work. Even the initial act of loading it from the internet and copying it into memory to then train the LLM is a copy that can be governed by its license and copyright law
Some might hold that we've granted persons certain exemptions, on account of them being persons. We do not have to grant machines the same.
> In copyright terms, it's such an extreme transformative use that copyright no longer applies.
Has the model really performed an extreme transformation if it is able to produce the training data near-verbatim? Sure, it can also produce extremely transformed versions, but is that really relevant if it holds within it enough information for a (near-)verbatim reproduction?
BTW in 2023 I watched ChatGPT spit out hundreds of lines of F# verbatim from my own GitHub. A lot of people had this experience with GitHub Copilot. "98.7% unique" is still a lot of infringement.