How To Articles
by Thom Parker of WindJack Solutions.
Copyright
© 2004
by WindJack Solutions
Navigate the Internal Structure of a PDF Document
This article
discusses how you can use PDF CanOpener to gain a better understanding of how
the low level PDF Objects are grouped together to form the higher level
structures that make up a PDF Document.
A PDF Document is a many layered thing, that is, it has many layers of abstraction. You can look
at one from many different perspectives, each with its' own
advantages and disadvantages. At the lowest level, the PDF File contains the raw document data. Next up, the COS
Layer organizes this data into a tree of simple objects. At the
PD layer, these simple objects are put together to implement
useful intermediate level structures like Fonts and Images. These are in
turn organized into higher level constructs like Annotations
and Pages. Some of these objects are also used to
impose logical structure, like paragraphs and article threads.
And there are more layers still.
Each of these layers of abstraction has its' own independent
set of rules. For example, what constitutes a legal file
format may not contain any useful objects. The COS Object
Tree may contain many objects that do not contribute to the
document display or are completely unintelligible to Acrobat,
but still form a legal object tree. Knowing how to
navigate these structures is essential to any PDF related
development effort. PDF CanOpener makes this task easier
than it ever has been before by providing a concurrent display
of the document's COS Object Tree and navigation tools that show
the relationship between objects in the tree and graphics
displayed on the screen.
PDF File
Structure:
The PDF File Format is text with some binary data mixed in.
If you open it in a text editor you'll see the raw objects
that define the structure and content of the document.
Explicit object definitions are prefixed with some text that
looks like this '12 0 obj' ,
the number 12 is the object reference. This is called an
indirect object since it can be referenced by its' number.
You will also see objects without this reference prefix.
These objects are called Direct Objects and are always contained
inside other objects. A container object that references
another object does so with the syntax
'12 0 R' , to include the previous object defined
with '12 0 obj' .
There are only 8 low level, or COS, object types.
The first 5 are scaler (single value) types:
- Integer - in the file as a number without a decimal
point.
- Boolean - in the file as the text 'true'
or 'false'.
- Real Number - in the file as a number with a
decimal point.
- Name - in the file as '/text'
i.e. a forward slash, '/', followed by some text, no white space or punctuation
allowed.
- String - in the file as either '(...characters...)'
or '<...hexadecimal character
codes...>' .
The next 3 are container types:
- Dictionary - in the file as '<<...other
objects...>>' . Dictionary entries are always in
pairs, a Name Object followed by any other object type.
- Array - in the file as '[...other
objects...]'. Just a list of other object types.
- Stream - in the file as '20
0 obj<<...stream attribute objs...>>stream...binary data...endstream'
. This is the most complex type. It's actually a
Dictionary Object mated with a string a bytes. The Dictionary
contains information necessary for accessing the data in the
string of bytes. Streams are always indirect objects, so they
always begin with an object reference.
Getting tired yet? Looking at the raw data in the file is really not
very useful. The structure of a PDF File does not match the
structure of the PDF Document it describes. For the most part
the file looks like a list of unconnected object definitions.
To get a sense of how it all fits together you need to look at
it another way.
The COS Object Tree:
The COS Object Tree is the real meat of a PDF Document. Except for
some security info and the Info dictionary (neither of which
contribute to the document display), everything in the
document is in the COS Object Tree. Many of the nodes in this
tree are the root nodes for a higher level object type.
Typically this root node will be a COS Dictionary. Two
notable exceptions are the XObject, which has a COS Stream as
the root, and the Rectangle Object, which uses the COS Array as
its root object. Let's take a look at some of these
objects. For the purposes of this article I'll be using
screen shots of PDF CanOpener to display different parts the COS
Object Tree. This discussion will focus on the most
important part of the tree, the pages. Articles on
navigating other parts of the COS Object Tree will come later.
If you have PDF CanOpener just activate it and it will
show you the root of the current Document (if you don't have it,
download the free trial copy and install it now). Expand this node to
get a look at the first level of objects in the document.
Already this is easier than looking at the raw data in the file.
There are a lot of objects here, but for the minimal PDF Document
all you need is the "Page Tree," the root of which is
the first Pages object. The functions of some of
the other objects are obvious from the names, like OpenAction
and PageLayout, others are a bit cryptic until you know more
about the PDF Spec. What they all
have in common is that they all encapsulate properties that are
global to and extend the basic functionality of this document.
The entries shown above are fairly typical and all of
them, and a few more, are explained in the PDF spec, except for
the last two. I used PDF CanOpener to create these
custom objects and placed them in the document root. PDF
CanOpener uses the Acrobat SDK so you can do the same.
Acrobat will save them with the document even though it has no
idea what they are. It will do this because they are part of a
well formed object tree and it doesn't have to know about them.
It just ignores what it can't use, but will preserve these tree
nodes as long as they don't
interfere with anything else it wants to do. All third
party PDF
editing and viewing tools should do the same. This object tree is a very powerful and
flexible thing. It allows a PDF file to be both backward and
forward compatible. A PDF file's capabilities can be extended
not just by Adobe, but by anyone who defines a custom object
type and adds it to the COS Object Tree. Of course you
also have to write the software to take advantage of the new
objects and many people have, so some of the things you may see
in an object tree will not be in the PDF Specification.
Objects defined by the PDF Spec. tend to have a similar
structure. This similarity is most evident in PDF Objects that
were added to the spec at the same time. For
example, expand the Pages Dictionary.
There are 3 members, Type, Count, and
Kids. Most,
but not all, PD layer objects will have a Type member. In
this case it's "Pages" for a Pages Object. The
Count member indicates how many total pages there are in this
branch of the Page Tree. The Kids Array is a list of
either Pages or Page Objects. The Pages Objects are
the intermediate nodes of the Page Tree and the Page Objects
are the leaves. The depth of the tree depends on how many pages
are in a document. This document has 3 pages so it has a one
layer Page Tree. Acrobat starts adding layers of Pages Objects
to this tree as the number of pages in a document grows, mostly
keeping the number of objects in the Kids Array between 4 and
10 to create a balanced tree.
The Page Object encapsulates all the information
needed to draw a single document page. Expand a Page
Object..
As with the Pages Object, the Page dictionary has a
Type member. The main content of a page is stored in the
Contents member. This Stream Object contains a list
of PDF marking (drawing) operators. These operators are
the first items drawn on the page, followed by the Annotations,
in the Annots Array. The visible coordinate space
(or clipping area) for the page is defined by a
"Rectangle" Object. The Rectangle Object is an array of 4
fixed point numbers, they are the user coordinates for the left,
bottom, right, and top sides of the rectangle. This page has two
Rectangle Objects, MediaBox and CropBox, the
intersection of these two rectangles will determine the visible
drawing area of the page. There are 3 other types of
rectangular objects that a page might have, the BleedBox,
TrimBox, and ArtBox, these objects have special printing functions
and don't affect the visible display of the page in Acrobat.
The Content Stream may contain references to a number of
resources it needs during the drawing process. These
resources are of course, listed in the Resources Dictionary. The page content can also obtain resources
from its' page's parent's Resources Dictionary (rarely) if it
does not have a Resources Dictionary of its' own.
Open the Resources dictionary.
This dictionary contains lists of each type of resource referenced in the content stream.
Except for the "Pattern" Resource, this one contains every
type of resource defined in the PDF Spec. . Each of these
resources is used to control some drawing attribute in a section of the
page content.
- The ColorSpace and Font have obvious functions.
However, they can both be horrendously complex internally.
- XObjects are a way of both abstracting drawing elements
out of the stream and representing raster images. They
are very useful in page content, and are also used as the
graphical representation of all Annotations.
- The ProcSet list tells Acrobat what operator procedure
sets will be needed to render this content stream in
PostScript.
- The ExtGState object contains parameters for line width
and style as well as a whole host of parameters for
controlling
the fine details of how things are drawn.
- The Shading object provides a way of
controlling color transitions across a drawing area.
Each of these resource objects, except for the ProcSet, is a complex construction of both the simple COS
object types and higher level objects. For example, expand
the XObject list. If the page you're looking at doesn't
have one then just follow along with this example. The
root object for the XObject is a COS Stream. What
makes it an XObject are the entries in its' Attributes Dictionary. Open up one of the
XObjects
and take a look inside the Attributes Dictionary.
A couple of these entries are already familiar. The
Type entry, of course, has the value "XObject" and
the Resources Dictionary has the same format as the Resources
Dictionary in the Page Object. In fact, it can contain
more XObject resources that contain other XObject resources.
XObjects come in two flavors, "Form" and "Image,"
indicated by the value of the SubType entry. This one is a "Form"
XObject, which means that the stream data has a format identical
to the page content stream, but unlike the page content, its' resources are inside its' attribute dictionary.
An XObject's resources can also be in the page's
Resource
dictionary, but the PDF Specification discourages this practice, so
it is not common.
XObjects have their own coordinate
space, which is clipped by the BBox entry. The Matrix entry
provides the translation between the XObject's coordinate space
and the coordinate space of the containing object, in this case
it's the page's drawing area, called the User Space in the PDF
Spec.
We have traversed the COS Object Tree from the Document Root
down one path through several high level objects until we
reached the last object on this branch. The XObject shown
above is on the bottom of the drawing stack since its' resources
are empty. If it weren't empty we could keep pushing down
through the objects in the Resources until we found the last
thing that Acrobat needed to draw this object on the page, which
is what Acrobat has to do. Sometimes is seems as if the
hierarchy is endless. Partly this is because with PDF CanOpener
you are looking at the lowest level of abstraction in the PDF
Document. A higher level view would hide the endless
parameters some of the objects have. Acrobat can also
build some very complex structures for what seem like simple
things. As an example, create a Text Box Annotation on a
document and then look at its' COS Object Tree. It is
represented in the Annotation list as a single Annotation.
Now right click on it in the page and set the status.
Refresh the page's Annotation list in the PDFCanOpener display.
There are now 3 Annotations associated with the original Text
Box. Change the status again and it
becomes 5 Annotations. Acrobat uses these extra
Annotations to save the status history, adding a huge number of
COS Objects to the tree in the process.
Sometimes you may have to really look hard for the data you
want. The last example shows the depths to which some data
is buried. This screen shot is of a Multimedia Object that
plays a QuickTime movie. The Multimedia Object lives inside of
an Annotation. With the PDF Spec alone it would be very
difficult to understand the structure of this object and locate
the actual movie data. PDF CanOpener provides the
visualization of the COS Object Tree you need to get to the bottom
of the document, literally.
The inspiration
for this image was provide by Leonard Rosenthal, the PDF Guru
Extraordinaire.
We hope this material was helpful to you.
If you have any questions or comments for us or want more info
on PDF CanOpener, please send email to
info@windjack.com.
Check back regularly for new articles.