PDF Feature Exploration: Marked Content

This month marks a quarter-century of PDF! Since June 1993, PDF has provided a rich and powerful format for portable document creation. From the moment PDF was introduced, until the PDF format was made into an open, worldwide standard by ISO in 2008 – the growth in PDF’s features and capabilities in its first fifteen years was rapid and revolutionary. The past decade of relative stability has allowed the use of PDF and the availability of different PDF tools to proliferate. Others have looked back on the history and ubiquity of PDF. The Register and Motherboard have spent some good letters and pixels on the history of PDF. Much has been said about the fundamentals and foundations of PDF as a page description language. But PDF is so much more than just markings on a page. Today, I’m going to take you on a journey through an underappreciated but foundational capability in PDF files.

Exploring marked content

What is marked content? At its simplest, marked content allows a series of drawing operations to be grouped together into a collection. For example, the words of a sentence or the lines that make a shape can be thought of as a single collection of content. These collections are denoted in PDF with marked content sequences. Marked content sequences have been a part of PDF for much of its lifetime. Introduced with PDF 1.2 in 1996, marked content was added to support allowing Acrobat plugins to associate proprietary information with portions of pages. This represented the beginnings of adding structure to the page description model which PDF was founded on.
Marked content is very simply represented in PDF files. At its simplest, a specific point in a page content stream may be marked with a simple tag with the MP operator. Likewise, a collection of drawing operators may be bracketed in a set of BMC (Begin Marked Content) and EMC (End Marked Content) operators to group the content, with a tag to denote the role or significance of the marked content, like so:

/Collection_1 BMC
  % PDF content stream operators: text or path drawing operations
  % to collect up into one entity
EMC

More useful is the ability to associate a collection of information to page content in the form of property lists, with the DP operator (to mark a specific point in page content) or BDC operator (to start a section of marked content, terminated by a corresponding EMC operator) used in conjunction with a list of properties and references to other objects in the PDF file:

/Collection_2 << /PropertyName (Name for this property) >> BDC
  % PDF content stream operators grouped, where these all share
  % a common property or set of properties
EMC

Very quickly, many people saw the potential in being able to group page drawing elements together. Within the PDF standard and within applications, marked content enabled key capabilities:

Element grouping

Grouping drawing elements together to be worked with as a collection. Without marked content, line art can only be represented with its constituent drawing operations: lines, paths, and very simple shapes. Marked content allows grouping these together to form more complex line art, such as charts and graphs. With marked content, charts and graphs and other image items can be moved, copied, and otherwise manipulated as one item – instead of requiring users to tediously select the multiple pieces that make these up in a PDF file.

Logical structure

Describing the logical structure of PDF content. Without marked content, PDF page content can only represent visual elements. Logical structure came into PDF in 1999 with Reader & Acrobat 4.0 and PDF 1.3. With support for properties and property lists, marked content forms the foundation for grouping visual page parts by the document concepts these represent. Because marked content containers are nestable, hierarchical concepts such as chapters in a document can be created. And because logical structure and page drawing content are stored separately in a PDF, users who are interested in the logical structure of a PDF are not required to parse the graphical elements of pages in order to understand the structure.

Tagged PDF

Standardizing semantic content representation. Marked content and logical structure were brought together in PDF 1.4 in 2001, and a standard set of structure tags created, to form what is known as Tagged PDF. Tagged PDF and its Standard Structure Tags (SSTs) extend the abilities to group and mark up page content into a portable, interoperable series of document semantic concepts. Complex PDF content spanning dozens or hundreds of different image and text operations – such as tables – now has a way to be specified, exported, and re-imported into different applications, as well as transcoded into different formats.
Tagged PDF forms the foundation of concept – not just graphical content – interoperability, and conversion. From the foundation of marked content sequences, tagged PDF sets the basis for many PDF capabilities, including:

  • Content accessibility for screen readers and text to speech applications
  • Content extraction for automated content ingestion, in applications such as neuro-lingustic processing (NLP)
  • Content transformation for non-layout markup languages such as XML

Optional content

The ability of PDF 1.5 and later files to have different layers and layer groups is built on the marked content facility. Using marked content operators to group portions of page content, and using PDF arrays and dictionaries to make collections of these, PDF allows users to make entire sets of content where these can be displayed or hidden all together as one. Layers are a powerful application of marked content and other PDF features, and have several common uses:

  • Allowing the inclusion of multiple languages of text in one PDF document. Users can toggle between different languages to match their preference
  • Collecting multiple components of a drawing into one PDF file and controlling the view of these. In files such as architectural drawings, different elements such as wiring, plumbing, HVAC and structural components can all be included – and their display activated and inactivated individually – as needed by different viewers

From humble beginnings as a means to allow Acrobat plugins to store their own private information in PDF files, content grouping via marked content has provided the basis for many useful capabilities in PDF. We hope you’ve enjoyed this brief look into marked content in PDF files, as well as the different capabilities in PDF that build upon the simple but powerful concept of grouping visual elements together into common structures. If you have any questions, comment below or contact us.

Share this post with your friends

Share on facebook
Share on twitter
Share on linkedin

1 thought on “PDF Feature Exploration: Marked Content”

  1. Chetan Shinde

    Hi Mark,

    I am studying Marked content in PDF. Information provided here is very helpful, Thanks!

    I came across one PDF file which has Marked content but few object from marked content are hidden.
    So here one block of BDC-EMC has both visible and hidden objects. I don’t see OCGs array in document.
    How does this works, how to know which object (graphics/text) is visible and which one is hidden?

    I do not see option to attach pdf file here so sharing content stream. Here only one BT-ET block in “/PlacedPDF /MC0 BDC ” is visible all other are hidden.

    Thanks,
    Chetan

    PDF Content Stream
    —————————————–

    /Span <>BDC
    /Span <>BDC
    EMC
    EMC
    /Span <>BDC
    EMC
    /Span <>BDC
    /Span <>BDC
    EMC
    EMC
    q
    /Perceptual ri
    /GS0 gs
    /T1_1 1 Tf
    /Fm0 Do
    Q
    /Figure <>BDC
    /PlacedPDF /MC0 BDC

    BT
    0 0 0 1 k
    /Perceptual ri
    /GS0 gs
    /T1_0 1 Tf
    6.7092 0 0 6.7092 91.8006 408.647 Tm
    [(St)-20(andard)]TJ
    ET

    q
    67.107 261.154 77 188.188 re
    W n
    BT
    -0.12 Tw 6.7092 0 0 6.7092 332.5724 347.7748 Tm
    [(Mec)50(hanical T)115(ee)]TJ
    0 Tw 17.697 9.073 Td
    [(A)40(WW)40(A Ductile Iron Pipe)]TJ
    -34.941 -1.057 Td
    [(R)20(educing)]TJ
    -1.399 -20.545 Td
    (Outlet Coupling)Tj
    ET
    Q
    q
    67.107 261.154 77 188.188 re
    W n
    BT
    6.7092 0 0 6.7092 339.3285 306.0237 Tm
    [(Saddle-L)20(et)]TJ
    -18.251 6.751 Td
    [(R)20(educing)]TJ
    -4.096 -1.2 Td
    (\(2″ x 1\275″, 2\275″ x 2″, 3″ x 2\275″\))Tj
    -0.025 Tw 20.744 8.715 Td
    [(Flange A)20(dapter)]TJ
    0 Tw 2.279 -20.578 Td
    [(W)-20(ildcat)]TJ
    19.004 0 Td
    (HDPE Pipe)Tj
    ET
    Q
    q
    67.107 261.154 77 188.188 re
    W n
    BT
    -0.025 Tw 6.7092 0 0 6.7092 467.048 359.0001 Tm
    [(IPS )-25(to A)40(WW)40(A)]TJ
    ET
    EMC
    EMC
    /Figure <>BDC
    /PlacedPDF /MC1 BDC
    Q
    q
    170.527 255.484 83.892 188.189 re
    W n
    BT
    6.7092 0 0 6.7092 73.8793 402.9777 Tm
    [(St)-20(andard)]TJ
    0.205 -7.706 Td
    (GapSeal)Tj
    -0.12 Tw 35.682 -1.367 Td
    [(Mec)50(hanical T)115(ee)]TJ
    0 Tw 17.697 9.073 Td
    [(A)40(WW)40(A Ductile Iron Pipe)]TJ
    ET
    Q
    q
    170.527 255.484 83.892 188.189 re
    W n
    BT
    6.7092 0 0 6.7092 65.2513 303.5076 Tm
    [(End P)20(rotection)]TJ
    38.18 -0.47 Td
    [(Saddle-L)20(et)]TJ
    ET
    Q
    q
    170.527 255.484 83.892 188.189 re
    W n
    BT
    -0.025 Tw 6.7092 0 0 6.7092 310.6531 396.0676 Tm
    [(Flange A)20(dapter)]TJ
    0 Tw 2.279 -20.578 Td
    (W)Tj
    6.7092 0 0 6.7092 171.4775 337.5969 Tm
    24.043 -11.863 Td
    (ildcat)Tj
    17.984 0 Td
    (HDPE Pipe)Tj
    -56.203 -0.017 Td
    [(F)20(astFit)]TJ
    4.1287 0 0 4.1287 96.3529 259.9574 Tm
    (\256)Tj
    -0.025 Tw 6.7092 0 0 6.7092 449.1268 353.3308 Tm
    [(IPS )-25(to A)40(WW)40(A)]TJ
    ET
    EMC
    EMC
    /Figure <>BDC
    /PlacedPDF /MC2 BDC
    Q
    q
    62.748 59.87 83.953 188.188 re
    W n
    BT
    6.7092 0 0 6.7092 -157.3332 207.3635 Tm
    [(St)-20(andard)]TJ
    0.205 -7.706 Td
    (GapSeal)Tj
    ET
    Q
    q
    62.748 59.87 83.953 188.188 re
    W n
    BT
    6.7092 0 0 6.7092 202.1706 207.3635 Tm
    [(A)40(WW)40(A Ductile Iron Pipe)]TJ
    -34.941 -1.057 Td
    [(R)20(educing)]TJ
    -1.399 -20.545 Td
    (Outlet Coupling)Tj
    -18.53 6.776 Td
    [(End P)20(rotection)]TJ
    ET
    Q
    q
    62.748 59.87 83.953 188.188 re
    W n
    BT
    6.7092 0 0 6.7092 -32.2543 150.0337 Tm
    [(R)20(educing)]TJ
    -4.096 -1.2 Td
    (\(2″ x 1\275″, 2\275″ x 2″, 3″ x 2\275″\))Tj
    ET
    Q
    q
    62.748 59.87 83.953 188.188 re
    W n
    BT
    6.7092 0 0 6.7092 222.231 62.3919 Tm
    (HDPE Pipe)Tj
    -56.203 -0.017 Td
    [(F)20(astFit)]TJ
    4.1287 0 0 4.1287 -134.8597 64.3432 Tm
    (\256)Tj
    -0.025 Tw 6.7092 0 0 6.7092 217.9142 157.7166 Tm
    [(IPS )-25(to A)40(WW)40(A)]TJ
    ET
    EMC
    EMC
    /Figure <>BDC
    /PlacedPDF /MC3 BDC
    Q
    q
    169.441 59.898 85.291 183.362 re
    W n
    BT
    6.7092 0 0 6.7092 -181.845 207.3911 Tm
    [(St)-20(andard)]TJ
    0.205 -7.706 Td
    (GapSeal)Tj
    -0.12 Tw 35.682 -1.367 Td
    [(Mec)50(hanical T)115(ee)]TJ
    ET
    Q
    q
    169.441 59.898 85.291 183.362 re
    W n
    BT
    6.7092 0 0 6.7092 -56.7661 200.2995 Tm
    [(R)20(educing)]TJ
    -1.399 -20.545 Td
    (Outlet Coupling)Tj
    -18.53 6.776 Td
    [(End P)20(rotection)]TJ
    38.18 -0.47 Td
    [(Saddle-L)20(et)]TJ
    -18.251 6.751 Td
    [(R)20(educing)]TJ
    -4.096 -1.2 Td
    (\(2″ x 1\275″, 2\275″ x 2″, 3″ x 2\275″\))Tj
    -0.025 Tw 20.744 8.715 Td
    [(Flange A)20(dapter)]TJ
    0 Tw 2.279 -20.578 Td
    [(W)-20(ildcat)]TJ
    ET
    Q
    q
    169.441 59.898 85.291 183.362 re
    W n
    BT
    6.7092 0 0 6.7092 -179.356 62.3054 Tm
    [(F)20(astFit)]TJ
    4.1287 0 0 4.1287 -159.3715 64.3708 Tm
    (\256)Tj
    ET
    EMC
    EMC
    Q

Leave a Comment

Your email address will not be published. Required fields are marked *