5 min read

What Is AI Training Data — and Was Your Art in It?

The images, music, and writing that trained the world's most popular AI systems came from somewhere. Here is how to find out if your work was included — and what you can do about it.

H
Humartz EditorialVerified Human
What Is AI Training Data — and Was Your Art in It?

Every AI image generator, music model, and writing tool was trained on data. That data came from the internet — billions of images, audio files, and text documents scraped from websites, portfolios, social media platforms, and online archives.

If you have posted creative work online at any point in the last decade, there is a realistic probability that some of it was included in a training dataset. This is not speculation — it's a documented fact about how these systems were built.

Here is what you can find out, what it means legally, and what options exist.

The Major Datasets

Most publicly known AI image models were trained on datasets that include a large portion of internet-scraped images. LAION-5B is the largest publicly documented one — a dataset of approximately 5 billion image-text pairs assembled by scraping the web.

Stable Diffusion, and several other open-source models, were trained on subsets of LAION. Midjourney and DALL-E have not fully disclosed their training data, but researchers and journalists have identified overlapping sources.

For music, datasets like Free Music Archive, MagnaTagATune, and various scraped audio collections have been used. Voice cloning models have been trained on YouTube recordings, podcast audio, and other publicly available speech.

For text, Common Crawl — a regularly updated archive of the web — is a foundational dataset for most large language models.

How to Check If Your Work Was Included

For images: Have I Been Trained (haveibeentrained.com) allows you to search the LAION-5B dataset by image or artist name. Upload an image or enter your name and it will return matching or similar images from the dataset. This is the most direct tool currently available for checking image training data.

Note that LAION-5B is not the only training dataset. Finding (or not finding) your work there doesn't tell you whether it appeared in proprietary datasets used by Midjourney, Adobe Firefly, or other commercial models.

For music and audio: No equivalent of Have I Been Trained exists yet for audio. You can search for your releases on known music training dataset registries, but coverage is incomplete. The emerging standard is to opt out proactively rather than search reactively.

For writing: Common Crawl archives are searchable, but the scale makes searching for specific works difficult. If your writing has been widely published online, assume it was included in training data for major language models.

What Being in Training Data Actually Means

This is where the legal question gets complicated.

Being in a training dataset does not automatically mean your rights were violated. The question is whether the act of training — copying your work to process it — constitutes copyright infringement, or whether it qualifies as fair use.

AI companies have argued the second. Rights holders and several ongoing lawsuits argue the first. Courts haven't fully settled it.

What is clearer: if a model's output reproduces substantial elements of a specific copyrighted work, that output infringes regardless of how the training was classified. And if an AI model is being used to generate content that competes directly with the market for the original work — a key factor in fair use analysis — the fair use argument weakens.

Opting Out

For future training, several opt-out mechanisms exist:

Spawning's opt-out tool at spawning.ai covers several major datasets and AI platforms that have agreed to honor opt-out requests. It's the most widely supported mechanism currently available.

robots.txt directives — specifically the emerging AiBots and GPTBot directives — can instruct AI scrapers not to index your site. Compliance varies by company, but major players including OpenAI have said they honor these directives.

Platform-specific opt-outs. Adobe, Shutterstock, and other platforms that use contributor content for AI training have added opt-out mechanisms to their contributor settings. If your work is distributed through platforms, check their AI usage policies and opt out where available.

Watermarking. Embedding technical signals in your work can, in principle, flag it for exclusion from training pipelines that check for such signals. This is emerging rather than established, but worth implementing as a layer.

The Honest Limitation

Opt-out mechanisms apply to future training. They don't remove your work from models that have already been built. The images, audio, and text that trained existing models are already incorporated into their weights — opting out now doesn't change that.

This is why forward-looking protection matters more than retroactive cleanup. Documenting your creative process, establishing verifiable provenance, and certifying your work as human-made creates a record that matters regardless of what happened to your existing work in historical training data.

You can't un-train the models. You can establish that your future work — and your documented past work — carries a verified human signature that no model output can legitimately claim.

Read Next
Humartz

Protect your creative legacy

Don't let your work disappear into the noise. Get a verified human badge that holds up legally and commercially.