If you’re not listening to the Latent Space podcast, you’re missing some of the best thinking on generative AI happening right now. The show notes for a recent episode begin,
Stop me if you’ve heard this before: “GPT3 was trained on the entire Internet”.
Blatantly, demonstrably untrue: the GPT3 dataset is a little over 600GB, primarily Wikipedia, Books corpuses, WebText and 2016-2019 CommonCrawl. The Macbook Air I am typing this on has more free disk space than that. In contrast, the “entire internet” is estimated to be 64 zetabytes, or 64 trillion GB. So it’s more accurate to say that GPT3 is trained on 0.0000000001% of the Internet.
I’ve been thinking about AI training data a lot recently, partly because of the fascinating intersections and overlaps between (1) curating and annotating training data for AI models and (2) designing instructional materials for human learners. But the topic of AI training data also intersects with another topic that’s near and dear to my heart: copyright. Recently there’s been a fairly heated conversation about whether AI should be allowed to train on copyrighted materials. People are even suing companies over companies over their uses of copyrighted work as training data. It’s gotten me thinking – most of my professional training has been based on copyrighted materials…
During graduate school I read a lot . I’m sure others read more, but I read my fair share. And I continue to be a curious and energetic reader today. And the overwhelming majority of things I read are copyrighted. I don’t have a photographic memory, so I wouldn’t claim that my studying copyrighted materials results in a permanent copy of those materials being created in my brain. But my reading does create a local copy of those materials created in my browser cache whenever I pull up an article. Of course, the browser cache is automatically cleared periodically, deleting the local copies of those copyrighted materials, so I don’t have a long-term copy of the copyrighted materials I read.
But reading those materials does create a long-term change in my understanding. That learning changes the things I say when I’m teaching class, or talking to colleagues, or writing essays or research articles. And that’s sort of the point, isn’t it? After I’ve read something relevant to a specific topic, the things I say on that topic should be slightly different in the future specifically because they’ll be influenced by what I learned through my study of those copyrighted materials.
Is that legal? Specifically, should I be able to teach and discuss and share and write and publish my “summaries” of (aka my understanding of) and my extensions of (new ideas I build on top of) the copyrighted things I’ve read? Emphatically, yes!! Ideas aren’t eligible for copyright protection – only a specific expression of an idea is eligible. So it’s not wrong for me to learn from copyrighted materials and then express the ideas contained in those materials using different language, or to evolve the ideas in those copyrighted works into new ideas.
However, in the hundreds or thousands of times I’ve talked about, say, instructional design in class, I dare say I’ve quoted copyrighted materials I had previously read – unknowingly; accidentally. If you explain the same idea enough times you’re eventually bound to explain it using the same language someone else has already used. Is that illegal? When I’m teaching, or reading, or writing, do I need to be constantly auditing my words to make sure I haven’t accidentally expressed an idea in exactly the same way someone else already has? (I hope not, because I don’t do that.)
A lot of people even wear their devotion to their idols like a very public badge of honor. To use a popular example, say in music, when someone like Michael Buble cites Frank Sinatra as an influence, saying ‘I listened to him all the time growing up, thousands and thousands of hours’, and then sounds a lot like him when he sings? Or, to bring things a little closer to your academic home, you may know someone who has read a lot of Freire or Kristeva or Bakhtin or Habermas and deeply assimilated their ideas, and who sounds a lot like them when they talk or write. The works of those authors are still copyrighted. Is it ok for people to go around repeating the ideas they learned from studying their copyrighted works, sounding almost identical to them? (Yes it is!)
The question becomes, then, how is my learning, study, and professional training based on copyrighted materials different from a generative AI tool training on copyrighted materials? Here are some of the ways the training of humans and generative AI with copyrighted works are the same:
- The generative AI tool temporarily has access to copyrighted materials when it trains, just like I do when I read.
- Once it’s done “reading” the copyrighted materials, it no longer has access to them, just like me after I close the browser.
- In the process of reading those copyrighted materials, the generative AI updates its model weights, kind of like the way my understanding of the world changes when I study.
- After the model trains on some copyrighted materials, the things it “says” in the future will be different from the things it would have said prior to that additional training (or “finetuning” as it’s known in the generative AI lingo).
- If a generative AI is asked about any given topic a large enough number of times, at some point it’s likely to express an idea in a way very similar to the way that idea was expressed in one of the copyrighted works it trained on.
What are some of the ways they are different? Are those differences significant enough that it should be illegal for generative AI to train on copyrighted material?