Protect Your PDFs from AI Data Scraping: TDMRep and PDF Optimizer

Protect Your PDFs from AI Data Scraping: TDMRep and PDF Optimizer

Published April 27, 2026

AI systems are trained on data scraping from the web, and PDFs are a primary source. Research papers, legal briefs, financial reports, technical manuals, industry publications: if your organization publishes PDFs, there is a reasonable likelihood that some of them have already been ingested by AI training pipelines. In most cases, this happened without your knowledge and without your permission.

For many organizations, this is now a material concern. Content owners, publishers, legal teams, and compliance officers are asking the same question: is there a technical mechanism to signal that our content is not available for AI training? The answer, increasingly, is yes. The W3C has developed a standard for exactly this purpose. And PDF Optimizer now supports embedding it automatically.

What Is TDMRep?

TDMRep, which stands for Text and Data Mining Reservation Protocol, is a W3C specification that provides a machine-readable way for content owners to express their permissions regarding text and data mining. The standard was developed in response to the growing practice of AI training data collection, and it operates at the metadata level: rather than relying on access controls or legal notices that AI scraping systems may not observe, TDMRep embeds the permission statement directly in the document.

A TDMRep implementation consists of two metadata values. The first is a boolean reservation flag: a value of true means the content owner is reserving their TDM rights and does not grant permission for AI training. The second is an optional policy URL that points to a licensing page where organizations can indicate the conditions under which TDM access might be available, whether that means a licensing agreement, an academic exception, or another arrangement.

When TDMRep metadata is present in a PDF, compliant AI systems and responsible data collection pipelines are expected to read it and honor the stated permissions. The standard is supported by major European publishers and is referenced in the EU's implementation guidance for the Copyright Directive's text and data mining provisions.

Why This Matters Now

The legal landscape around AI training data is in flux. Courts in the United States and Europe are actively adjudicating the question of whether scraping copyrighted content for AI training constitutes infringement. Regulatory guidance is evolving. In this environment, embedding a clear machine-readable rights statement in your documents is a reasonable precaution, and in some jurisdictions it may be legally significant.

For publishers and media organizations, expressing TDM rights is directly relevant to licensing negotiations with AI companies. For enterprises publishing technical documentation, legal filings, or proprietary research, the concern is more about competitive exposure: you do not want your proprietary content training a competitor's AI model.

The challenge until now has been operational. Adding TDMRep metadata to every PDF in your inventory, and to every PDF produced by your systems going forward, requires touching each document at the metadata level. For organizations managing thousands or millions of PDFs, this was not a practical undertaking.

How PDF Optimizer Embeds TDMRep Metadata

PDF Optimizer now supports TDMRep metadata as a configurable option in the optimization workflow. When enabled in your JSON profile, PDF Optimizer embeds the TDMRep reservation flag and, optionally, your policy URL into the XMP metadata of each processed PDF during the optimization pass.

This means the operation is automatic and does not require any additional tooling or processing steps. If you are already running PDFs through an optimization pipeline, adding TDMRep protection is a matter of adding the relevant settings to your existing JSON profile. Every document that passes through the optimizer receives the metadata. Every document that goes out carries a machine-readable statement of your rights.

For organizations that do not yet have an optimization pipeline, this creates a compelling reason to implement one: you get file size reduction, PDF/A archival compliance if needed, and AI rights protection in a single automated workflow.

What a TDMRep Profile Configuration Looks Like

In your PDF Optimizer JSON profile, the TDMRep settings are straightforward. You specify a reservation value of true to indicate that text and data mining rights are reserved, and you provide your policy URL if you have a published licensing page. These settings sit alongside your compression and optimization settings in the same profile file.

Once the profile is configured, the metadata is embedded automatically for every document processed with that profile. You do not need to handle it document by document. The metadata travels with the PDF wherever it goes, whether to a public website, a document portal, a shared repository, or a third-party distribution channel.

What TDMRep Does and Does Not Do

It is worth being clear about the scope of what TDMRep provides. It is a rights signal, not an access control mechanism. It does not encrypt your PDFs or prevent them from being opened. It does not technically block a scraping system from reading the document's content.

What it does is create a clear, machine-readable statement of your permissions that responsible AI developers and compliant data collection pipelines are expected to honor. In jurisdictions where TDM rights have legal standing, such as the European Union under the Copyright Directive, the presence of a TDMRep reservation may have legal weight in disputes over unauthorized use.

For organizations concerned about the most aggressive data collection practices, TDMRep should be viewed as one layer of a broader content protection approach that may also include access controls, watermarking, and legal terms of use. But as a lightweight, scalable, and automated measure that requires no per-document effort once configured, it is a sound addition to any content publishing workflow.

Getting Started

A PDF Optimizer free trial is available for Windows and Linux.

If your organization publishes PDFs and has not yet considered how TDMRep fits into your workflow, the conversation is worth having now. The standard is gaining adoption among major publishers, regulatory guidance is pointing toward machine-readable rights signals as a preferred mechanism, and the operational cost of implementing it through PDF Optimizer is minimal.

For technical teams evaluating the integration, the developer documentation covers the full XMP metadata configuration and the expected output format. For teams with questions about AI data rights strategy, set up a meeting with one of our engineers.