What Is AI Training Data — and Was Your Art in It?
The images, music, and writing that trained the world's most popular AI systems came from somewhere. Here is how to find out if your work was included — and what you can do about it.

Every AI image generator, music model, and writing tool was trained on data. That data came from the internet — billions of images, audio files, and text documents scraped from websites, portfolios, social media platforms, and online archives.
If you have posted creative work online at any point in the last decade, there is a realistic probability that some of it was included in a training dataset. This is not speculation — it's a documented fact about how these systems were built.
Here is what you can find out, what it means legally, and what options exist.
The Major Datasets
Most publicly known AI image models were trained on datasets that include a large portion of internet-scraped images. LAION-5B is the largest publicly documented one — a dataset of approximately 5 billion image-text pairs assembled by scraping the web.
Stable Diffusion, and several other open-source models, were trained on subsets of LAION. Midjourney and DALL-E have not fully disclosed their training data, but researchers and journalists have identified overlapping sources.
For music, datasets like Free Music Archive, MagnaTagATune, and various scraped audio collections have been used. Voice cloning models have been trained on YouTube recordings, podcast audio, and other publicly available speech.
For text, Common Crawl — a regularly updated archive of the web — is a foundational dataset for most large language models.
How to Check If Your Work Was Included
For images: Have I Been Trained (haveibeentrained.com) allows you to search the LAION-5B dataset by image or artist name. Upload an image or enter your name and it will return matching or similar images from the dataset. This is the most direct tool currently available for checking image training data.
Note that LAION-5B is not the only training dataset. Finding (or not finding) your work there doesn't tell you whether it appeared in proprietary datasets used by Midjourney, Adobe Firefly, or other commercial models.
For music and audio: No equivalent of Have I Been Trained exists yet for audio. You can search for your releases on known music training dataset registries, but coverage is incomplete. The emerging standard is to opt out proactively rather than search reactively.
For writing: Common Crawl archives are searchable, but the scale makes searching for specific works difficult. If your writing has been widely published online, assume it was included in training data for major language models.
What Being in Training Data Actually Means
This is where the legal question gets complicated.
Being in a training dataset does not automatically mean your rights were violated. The question is whether the act of training — copying your work to process it — constitutes copyright infringement, or whether it qualifies as fair use.
AI companies have argued the second. Rights holders and several ongoing lawsuits argue the first. Courts haven't fully settled it.
What is clearer: if a model's output reproduces substantial elements of a specific copyrighted work, that output infringes regardless of how the training was classified. And if an AI model is being used to generate content that competes directly with the market for the original work — a key factor in fair use analysis — the fair use argument weakens.
Opting Out
For future training, several opt-out mechanisms exist:
Spawning's opt-out tool at spawning.ai covers several major datasets and AI platforms that have agreed to honor opt-out requests. It's the most widely supported mechanism currently available.
robots.txt directives — specifically the emerging AiBots and GPTBot directives — can instruct AI scrapers not to index your site. Compliance varies by company, but major players including OpenAI have said they honor these directives.
Platform-specific opt-outs. Adobe, Shutterstock, and other platforms that use contributor content for AI training have added opt-out mechanisms to their contributor settings. If your work is distributed through platforms, check their AI usage policies and opt out where available.
Watermarking. Embedding technical signals in your work can, in principle, flag it for exclusion from training pipelines that check for such signals. This is emerging rather than established, but worth implementing as a layer.
The Honest Limitation
Opt-out mechanisms apply to future training. They don't remove your work from models that have already been built. The images, audio, and text that trained existing models are already incorporated into their weights — opting out now doesn't change that.
This is why forward-looking protection matters more than retroactive cleanup. Documenting your creative process, establishing verifiable provenance, and certifying your work as human-made creates a record that matters regardless of what happened to your existing work in historical training data.
You can't un-train the models. You can establish that your future work — and your documented past work — carries a verified human signature that no model output can legitimately claim.
2026-03-14 · 6 min read
What Happens When AI Impersonates You Online — and How to Fight Back
AI-generated content is being posted under real artists' names, passed off as their work, and used to sell products they never endorsed. Here is what your options are.
2026-03-12 · 5 min read
How to Build a Portfolio That Proves Human Creativity
A portfolio used to be a showcase of finished work. In 2026, it needs to do more — it needs to demonstrate that a human made it. Here is how to build one that does both.
2026-03-10 · 5 min read
How to Price Your Human-Made Art in an AI World
AI has compressed prices at the bottom of the creative market. But the market for provably human creative work is developing a premium. Here is how to price into it.
Protect your creative legacy
Don't let your work disappear into the noise. Get a verified human badge that holds up legally and commercially.