Down with copyright-infringing LLMs—long live the small language model
Jenny Hedley
At my final PhD milestone conference, one of my writer friends whispered conspiratorially about how ‘we all hate AI’, sweeping me into the prevailing category of Writers Against Machine Learning. I did not then reveal my position, which is more complex than love or hate. If my (non-existent) published novel had been cannibalised by large language models (LLMs) via pirated databases as my friend’s had, my curiosity about the possible benefits of AI might similarly be quashed. For years now my interest in digital writing has been guided by the ethos of the Ouvrior de Littérature Potentielle (Oulipo) who viewed ‘artificial or mechanical procedures’ as tools for creative inspiration (Queneau 51). Since generative AI offers a play space for such potential and combinatory literature, I wonder how writers might therefore extend creativity through cyborgic collaboration.[1]
As someone who makes a living through writing—via multimodal creative research and practice—I align with AI naysayers on the issue of copyright. We can all agree that material under copyright should not be available as training data without authorial permission and fair remuneration. Training commercial models on pirated books is an unethical practice which has been adopted widely by the AI giants. Anthropic trained their chatbot Claude on the infamous Books3 online library, plus at least five million pirated books from the Library Genesis (LibGen) site, and two million from the Pirate Library Mirror (Associated Press). Meta and OpenAI have similarly trained their models on LibGen. As part of an ongoing AI Watchdog investigation, The Atlantic hosts a search tool that allows readers to discover which authors are included in that LibGen data set (Reisner).
Many large corporations have attempted to avoid charges of copyright infringement by claiming that their models don’t retain verbatim copies of copyrighted works. In their respective comments to the US Copyright office, OpenAI attested that ‘the models do not store copies of the information that they learn from’ (6) and Google claimed that the ‘deconstructive, computational use of creative works in model training is fundamentally different from the communicative, aesthetic purpose for which those works were created’ (11). However, it is evident that the models carry more than a statistical relationship between training data—original copyrighted content can be reproduced word for word.
Recent research illustrates instances in which popular LLMs have memorised copyrighted books. A team of Stanford and Yale researchers were able to prompt Gemini 2.5 Pro and Grok 3 to extract 76.8% and 70.3%, respectively, of the text of Harry Potter and the Sorcerer’s Stone (Ahmed et al.). The team had to employ a simple Best-of-N jailbreak technique—which involves techniques such as substituting characters with glyphs, flipping character case, and changing word order—.
Given the problematic nature of LLMs’ training, together with the ecological cost of the server farms required to host, train, and run inference for these gargantuan models, it makes sense to shift our attention to another class of LMs. Small language models (SLMs)—or tiny language models—are so efficient that they can be run locally on consumer-grade computers. Where LLMs’ parameters run to the hundreds of billions, these compact models have up to seven billion parameters—a size that emerges as ‘a practical sweet spot balancing computational efficiency with knowledge capacity’ (Corradini et al. 6).
Unlike the commercial giants who refuse to publish—and even lie about—the data used to train their LLMs, many SLMs are more transparent. There are a plethora of models available on Hugging Face, for example, a commercial platform which attracts a global community of developers who share open-source models and datasets that are accessible for inspection. By shifting attention to SLMs, we can reduce our dependency on LLMs and their vectorialist controlling interests.[2] By embracing an open-source ethos and reporting transparently on training methods and data, we can share knowledge for the benefit of the people rather than the corporations who have indiscriminately fed their language models on the labour of us creatives.
In this limited series blog-style investigation, I will write about training an open source SLM on my own creative writing—and my mother’s. Both the training and the prompting of my JenAI model will be run locally on my hard drive to protect the integrity of my unpublished and copyrighted works. My aims are to assess the feasibility of authors training their own proprietary SLMs (in a move which is replicable, and very David versus Goliath), and While human–AI co-authorship is increasingly common—for example, David Jhave Johnston’s multi-volume ReRites emerged from daily interactions with a language model trained on de-identified poetry gleaned from the web (Johnston & Rettberg)—training a SLM ethically on one’s own corpus remains a nascent field of study for creative writer researcher-practitioners. By feeding my (and my mother’s) words into a machinated choric vessel[3], I am looking for whispers of potential selves-in-the-making, for echoes of speculative pasts.
[1] Donna J. Haraway calls the cyborg ‘a cybernetic organism, a hybrid of machine and organism, a creature of social reality as well as a creature of fiction’ (5).
[2] McKenzie Wark’s A Hacker Manifesto defines the new ruling class as the vectorialists—the extractivist purveyors of information and surveillance capitalism who control the communication vectors.
[3] In Plato’s Timaeus the khôra is a receptacle or maternal vessel; Julia Kristeva reframes chora as ‘receptacle of narcissism’ (13). As predictive repository for (my) words, a personal fine-tuned AI model potentially offers a mathematically-defined, probabilistic narcissistic reflection through which what is repressed might be alchemised.
Works cited
Ahmed, Ahmed, A. et al. ‘Extracting Books from Production Language Models.’ arXiv:2601.02671, arXiv, 6 Jan. 2026, https://doi.org/10.48550/arXiv.2601.02671.
The Associated Press. ‘Anthropic to Pay Authors $1.5B to Settle Lawsuit over Pirated Chatbot Training Material.’ NPR, 5 Sept. 2025. Business, https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-settlement-pirated-chatbot-training-material.
Corradini, Flavio et al. ‘State of the Art and Future Directions of Small Language Models: A Systematic Review.’ Big Data and Cognitive Computing, vol. 9, no. 7, July 2025, p. 189, https://doi.org/10.3390/bdcc9070189.
Google. ‘Comment from Google Posted by the U.S. Copyright Office on Nov 1, 2023.’ Regulations.Gov, 2023, https://www.regulations.gov/comment/COLC-2023-0006-9003.
Haraway, Donna J. Manifestly Haraway. University of Minnesota Press, 2016.
Johnson, David Jhave, and Scott Rettberg. ‘Off Center Episode 13: Creative AI with David Jhave Johnston.’ Electronic Book Review, 8 Dec. 2024, https://electronicbookreview.com/publications/off-center-episode-13-creative-ai-with-david-jhave-johnston/.
Kristeva, Julia. Powers of Horror: An Essay on Abjection, translated by Leon S. Roudiez. Columbia University Press, 1982.
OpenAI. ‘Comment from OpenAI Posted by the U.S. Copyright Office on Nov 1, 2023.’ Regulations.Gov, 2023, https://www.regulations.gov/comment/COLC-2023-0006-8906.
Plato. Timaeus. Plato: Complete works [e-book edition], edited by In John M. Cooper. Hackett Publishing Company, 1997.
Queneau, Raymond. ‘Potential Literature.’ Oulipo: A Primer of Potential Literature, edited by Warren F. Motte, Jr. University of Nebraska Press, 1986.
Reisner, Alex. ‘Search LibGen, the Pirated-Books Database That Meta Used to Train AI.’ The Atlantic, 20 Mar. 2025, https://www.theatlantic.com/technology/archive/2025/03/search-libgen-data-set/682094/.
Wark, McKenzie. A Hacker Manifesto. Harvard University Press, 2004.
The series
Part 2: https://southerlylitmag.com.au/to-each-author-a-mimetic-ai-model/
Part 4: https://southerlylitmag.com.au/archival-bots-my-mother-my-model-for-language/
About the author
Jenny Hedley is a neurodivergent writer, digital artist, literary critic, teacher and third-year PhD candidate at RMIT whose research spans personal archives, autotheory, experimental nonfiction, digital and creative-critical writing. Links to her works can be found on jennyhedley.github.io. She lives on unceded Boon Wurrung land with her son.
@ jennyisanauthor@gmail.com
About the artwork
An image the author generated with student access to Adobe Firefly. The prompt used was “genAI gobbling copyrighted works during model training”. Adobe Firefly is trained on licensed images and is a more ethical option for image generation than models trained on pirated or otherwise stolen images.