License the work.
Keep the models honest.

AITrainingMart connects the people who made the internet with the companies training on it - under real licenses, with real payouts, and a real audit trail. The Amazon for AI training content.

$0B
AI training data market by 2028
CAGR 22%
0+
Active AI copyright lawsuits
US + EU, 2024-26
$0B
Anthropic class-action settlement
Largest to date

AI is being built on a legal fault line.

The largest models were trained by scraping first and asking never. The bill is now coming due - in courtrooms, in creator revenue, and in the slow enclosure of the open web.

For AI labs

Existential legal exposure

70+ active suits. Discovery obligations. Training pipelines subpoenaed. One bad ruling can force a model re-train.

For creators

Work scraped, nothing earned

Decades of writing, photos, code and music ingested without consent, credit, or a dollar in return.

For the web

Robots.txt is breaking the internet

Open publishing collapses as every site walls off. Everyone loses the commons that made AI possible.

One licensed marketplace.
Both sides whole.

Creators list what they own. AI companies buy what they need. Every transaction is a signed license, a verifiable receipt, and a royalty stream that keeps paying as the model keeps earning.

Creators
List & license
Photos
Articles
Code
Audio
AITM
Rights engine
Watermark audit
Provenance chain
Smart contracts
Royalty ledger
AI companies
Train & deploy
Pre-train corpora
SFT datasets
RLHF prefs
Eval sets
01
Signed licenses
Every purchase creates a cryptographically signed usage contract.
02
Provenance built-in
C2PA + on-chain manifest. Defensible audit trail in any jurisdiction.
03
Royalties that compound
Creators keep earning when their data trains downstream models.
04
One integration
Dataset delivery to S3, GCS, R2, or your training cluster via API.

Four products.
One end-to-end rights stack.

From the moment a creator uploads a file to the moment a model deploys in production - every step has an owner, a receipt, and a royalty.

01 · FLAGSHIP
★ All-in-one platform

Core Marketplace

The exchange. List a dataset, discover a corpus, sign a license, settle a payout - all in one flow. Granular by modality, rights window, and exclusivity.

Explore the marketplace
At a glance
Modalities
Text · Image · Audio · Code · Video
Pricing
Fixed, auction, or royalty
Settlement
ACH · Wire · USDC
02 · MODULE

PreprocessX

Turn raw archives into training-grade corpora. Dedup, PII strip, language ID, toxicity filter, doc-quality scoring.

Throughput
4.2B tok/hr
Filters
42 pre-built
Output
JSONL · Parquet
03 · MODULE

HAIperTuneX

Fine-tune on licensed data without standing up the infra. Point at a model card, pick a corpus, hit run.

Engine
customLLM runtime
Modes
Instruct · preference · eval
Audit
Per-sample lineage
04 · API

Data Pipeline API

The programmatic spine. Stream licensed corpora directly into your training cluster. Rights-scoped, usage-logged.

Latency
p50 38ms
Protocols
REST · gRPC
SLA
99.95%

Built for both sides of the table.

For creators

Get paid when your work teaches a machine.

01
One-click listing
Upload a folder, an RSS feed, a GitHub repo. We handle the format.
02
You set the terms
Exclusive or non-exclusive. Pre-train or fine-tune only. Flat fee or royalty.
03
Compounding income
Earn every time a model derived from your work ships an update.
For AI companies

Train without the courtroom overhead.

01
Defensible corpora
Every token has a signed license. Discovery becomes a query.
02
Sourced on-spec
Request-for-data auctions: describe the gap, get curated bids.
03
Coverage you can price
Indemnity wrapper available - turn legal risk into a line item.
For developers

Build on the pipeline.

01
Programmatic access
REST + gRPC endpoints. Stream licensed corpora straight into your training job or notebook.
02
JSON / JSONL / Parquet
Structured output, schema-pinned. Drop into HuggingFace datasets with one line.
03
Pay-per-token pricing
Metered by the million tokens. No seats, no commits - scale from notebook to cluster.

A category forming in real time.

$4.2B
AI TRAINING DATA TAM, 2026

The data that trains frontier models was worth roughly nothing a decade ago. By 2033 it will be a $16B market growing at 22% CAGR- and that’s only the clean, licensed slice. AITM is the exchange infrastructure that makes the slice legible, tradable, and auditable.

Licensed training-data TAM · USD B
▲ CAGR 22.0%
$4.2B
$5.1B
$6.3B
$7.6B
$9.3B
$11.4B
$13.9B
$16B
2026
2027
2028
2029
2030
2031
2032
2033
2.4M
Creators earning nothing on scraped works today
64%
Of top-100 sites now block AI crawlers
$1.1B
Published licensing deals in 2025 alone
3 of 4
Frontier labs under active discovery

The next era of AI will be built on consented data.
Get on the list.

Be first to access AITrainingMart - for creators and AI companies. Private beta opens Q3 2026.

No spam · Unsubscribe anytime · Your spot is free

By joining you agree to our Terms and Privacy Policy.