Protect Your PDFs from AI Data Scraping
How PDF Optimizer Supports TDMRep Metadata
If your organization publishes PDFs and you are concerned about AI models training on your content without permission, you now have a way to signal that at the document level, automatically, as part of your optimization workflow.
PDF Optimizer now supports TDMRep, the W3C's Text and Data Mining Reservation Protocol. This means that when you optimize a PDF for size, performance, or archival compliance, you can simultaneously embed a machine-readable statement of your AI data rights into the document's metadata. One processing pass. Both outcomes.
The Problem: AI Models Scrape PDFs
AI language and multimodal models are trained on large corpora of text and document data scraped from the web. PDFs are a significant source of that training data: research papers, legal documents, financial reports, technical manuals, policy documents, and published books are all regularly ingested by automated data collection pipelines.
In many cases, this happens without the knowledge of the content owner and without any mechanism for the owner to signal whether they consent to that use. Standard access controls and robots.txt directives were not designed with AI training data collection in mind, and they are inconsistently observed by data collection systems.
For publishers, enterprises, and institutions that publish PDFs publicly or semi-publicly, the risk is real: proprietary research, confidential methodologies, or commercially valuable content may be entering AI training pipelines without authorization.
What Is TDMRep?
TDMRep, which stands for Text and Data Mining Reservation Protocol, is a standard developed by the W3C to provide a machine-readable way for content owners to express their text and data mining permissions. It operates at the metadata level, embedding the rights statement directly in the document so that it travels with the PDF wherever it goes.
The protocol consists of two elements. The first is a boolean reservation flag: setting this to true signals that text and data mining rights are reserved and that the content owner does not grant permission for AI training use. The second is an optional policy URL, pointing to a page where licensing information, contact details, or policy terms can be found by parties who wish to request TDM access.
TDMRep is referenced in the EU's implementation guidance for text and data mining provisions under the Copyright Directive (DSM/CDSM), giving it legal standing in European jurisdictions. It is supported by major European academic publishers and is gaining adoption among rights-conscious content organizations globally.
What Is New: TDMRep in PDF Optimizer
PDF Optimizer now supports adding TDMRep metadata to the XMP metadata block of any PDF processed through the optimizer. When configured in your JSON profile, PDF Optimizer embeds the tdm-reservation boolean and, optionally, a tdm-policy URL into the document's XMP metadata during the optimization pass.
This means:
• No additional tool or processing step is required. The rights metadata is embedded as part of the same optimization workflow that handles compression, color conversion, and PDF/A archiving.
• The metadata is applied consistently to every document processed with a profile that includes the TDMRep settings. One profile change protects every document that passes through your pipeline going forward.
• The rights statement travels with the PDF. Wherever the document is distributed, downloaded, or redistributed, the machine-readable TDMRep metadata is present.
How to Configure TDMRep in Your PDF Optimizer Profile
Decide first on your TDM policy. Do you want to reserve all rights, effectively signaling that AI training use is not permitted? Do you want to allow TDM under certain conditions, such as for non-commercial academic research? Do you want to point to a licensing page where parties can request access?
Once your policy is defined, add the TDMRep settings to your JSON profile alongside your existing optimization settings. Set tdm-reservation to true to reserve rights, and optionally provide a tdm-policy URL pointing to your licensing or policy page. PDF Optimizer will embed these values into the XMP metadata of every document processed with that profile.
For organizations distributing many types of documents with different rights profiles, you can maintain separate profiles: one for publicly accessible content where TDM is permitted, and one for proprietary or restricted content where rights are reserved.
Why This Matters Now
The legal and regulatory landscape around AI training data is shifting quickly. Courts in the United States and Europe are actively examining whether scraping copyrighted content for AI training constitutes infringement. Regulatory guidance is emerging in the EU under the DSM Directive. In jurisdictions where TDMRep has legal standing, an embedded reservation signal may be meaningful in disputes over unauthorized use.
Beyond the legal dimension, embedding TDMRep metadata is a signal of responsible content stewardship. It communicates clearly to AI developers and researchers what your organization's position is on the use of your content, without requiring them to find and read a terms-of-service page.
For organizations that are already optimizing PDFs in a production pipeline, adding TDMRep support is a low-cost addition to an existing workflow. For organizations that are not yet optimizing their PDFs, TDMRep is one more reason to start.
For technical implementation
details on using TDMRep in PDF Optimizer, see the companion developer post:
Expressing Text and Data Mining Rights with PDF Optimizer + TDMRep, which
covers XMP metadata configuration, JSON-LD policy documents, and ODRL best practices
for developers implementing the standard.
Protect your documents from AI scraping with a free trial of PDF Optimizer.
Frequently Asked Questions
What is TDMRep?
TDMRep, or Text and Data Mining Reservation Protocol, is a W3C standard that provides a machine-readable way for content owners to signal whether AI text and data mining of their content is permitted. It embeds a reservation flag and optional policy URL directly into a document's metadata, so the rights statement travels with the document wherever it is distributed.
How do I prevent AI from scraping my PDF content?
Embed TDMRep metadata with tdm-reservation set to true in your PDFs. PDF Optimizer can do this automatically during the optimization pass when configured in your JSON profile. This embeds a machine-readable signal in the document's XMP metadata that compliant AI systems and data collection pipelines are expected to observe.