June 10, 2003
Want to dig even deeper? Post to the new MacEdition Forums (beta)!
XML. It’s everywhere. In your day-to-day existence, you’ve probably had some exposure to XML in some form, whether it’s just through industry buzzword immersion or you’ve been tasked with actually using XML files. For us Mac programmers, Apple provides some XML technologies you can use in your own programs. It’s not the most full-featured or the most robust, but for many uses of XML, it’s pretty good.
At its most basic level, XML is text markup. Like HTML, there are tags
> which indicate
“here is some text that is a name” or “here is some text that should
be emphasized”. The tags describe a hierarchy of information, such as this
... which describes a “critter” by specifying the name and type.
There could be multiple
<critter></critter>s in the data if you so
XML parsers take the marked-up text and turn it into in-memory structures that your program can then deal with. In this case, you might have an Objective-C object that represented naked mole rats, which had a name field.
XML in Cocoa
If you’re dealing exclusively with “property list” Cocoa objects (like
arrays, dictionaries, strings, dates,
NSData), Cocoa has
some built-in features to store your data as XML and to read it all
back in. These allow you to use XML as your document format in case
your users need to poke around in the document using XML tools or you
have a contractual requirement to store the data in XML. These routines
are also handy for making a quick data exchange file format.
NSPropertyListSerialization class can be used to convert
your property list objects into an
NSData containing the
XML, which you can then write to a file or send to another
program over the network. The class can also be used for converting
XML back to your objects.
For instance, say you have an NSDictionary that contains your document information (which is stuffed full of arrays and strings and whatnot). Packing it up into XML is just:
Similarly, if you have an NSData with your document contents, you can recreate your objects with:
The interesting argument is the mutabilityOption. You have three options:
- all of the objects created will be of the immutable variety. You can look, but you can’t change.
- any container class (arrays and
dictionaries) will be created as a mutable version (
NSMutableArray, for instance) so you can then add or remove objects from the collection, but the non-container classes will be immutable.
- gives you mutable containers,
and the contents will be mutable as well (
Digging deeper: Apple’s XML Services
Inside the Core Foundation framework lurks the XML Services, a set of utilities for parsing XML documents into an in-memory form which your program can use. The XML parser isn’t as full-featured as some others available out there, but it is pretty easy to get the basics up an running, plus you know it’ll be on every OS X system.
There are two faces to the XML services. The “high level” API will take the XML file and generate an in-memory tree with it. The “low level” API, which is a little more difficult to use, will invoke callback functions you supply as the parser works through the XML file.
The high-level API
To use the high-level API you can either specify
an URL to get the data or you
can put the XML text into a
we’ll use the
CFData route, and use just the plain
URL route in the low-level API discussion. So, the
basic plan is once you have the
CFData, give it to
CFXMLTreeCreateFromData to actually parse the
XML into a
CFTree, and then walk the tree to do
whatever it is you need to do.
Before getting into the code, be aware that the Core Foundation services tend to be a bit more verbose than their Cocoa counterparts, but the actual code is pretty straightforward. You can download the code presented here, along with Project Builder projects at http://borkware.com/rants/me-xml/.
So, back to actually using the high-level API. Here a
CFURL is made
with the location to find the XML data.
Load the URL’s contents and stuff it into a
Here, the only interesting part is the actual data, not any associated properties (like the permissions and last modification date for files). For Web URLs, the properties contain the headers returned by the Web server.
CFData named xmlData now has the XML content. Parse it with this:
This walks the XML and builds a tree in memory. The next to last
argument is there to keep a bunch of tree nodes from being built for
whitespace between tags. As an aside, you can supply the
directly as the datsource if you want and avoid making a
I did it this way originally during development so I could poke around in
CFData and verify that yes, it indeedy has the
XML I’m looking for.
Great, now we have a
CFTree. What is it, and what can
you do with it? A
CFTree is a Core Foundation collection class
that builds trees. A
CFTree object can have some data associated
with it and can contain any number of CFTrees as children. As
expected, there are methods for traversing the tree.
The tree returned from
actually a tree built of
CFXMLNodes, which describe an individual
XML construct, like a tag, comment, or a chunk of data. Each
XML node has a type, a string of data, and a pointer to some
other data structure based on the type. There are a bunch of different
node types, like documents (representing the whole XML
document), elements, processing instructions, comments, entities,
whitespace and other more esoteric types.
Here is a function that walks the tree and displays information about it:
CFTree calls of interest are
CFTreeGetChildCount, which gives the number of children a
node has, and
CFTreeGetChildAtIndex, which gives
you the nth child. The
API is not recursive, so to process the whole tree instead of
just the current level, you’ll need to do that work yourself.
The low-level API
The high-level API is
nice in that it’s pretty simple to go from XML file to something
in-memory you can play with. Unfortunately, it’s not very flexible.
CFTree doesn’t have a Cocoa object toll-free
bridged to it (like the equivalence between
NSDictionary ), so actually using the tree produced from the
XML is tedious since you can’t directly stick the nodes
NSDictionary, or easily use it as
a data source for an
The low-level API gives you more control over the parsing process, such
as skipping over elements if you’re not interested in them, which can
be handy if you have a huge XML document and only want a subset. The
low-level API also gives better error-checking and reporting. If
CFXMLTreeCreateFromData has a problem, it just returns
NULL, leaving you with no clue what went wrong.
So what is this low-level API? Basically, you define some callback functions (three are required, two more are optional) and give them to a parser object. As the XML gets parsed, these functions get invoked at various points along the way, and inside of the functions you decide what should happen. Like when a new element needs to be created, you can decide what kind of Cocoa object to create to hold any child items, and then create the object and set any properties.
Before we start, one of the annoying things about the low-level API is that more stuff is documented than is actually implemented. So sometimes you see documentation and sample code for a particular feature, then look a little deeper and see that it is “unsupported”. Unfortunately it’s some of the cooler features, like using the XML file’s DTD for verification and expansion of entities, and some of the really useful features, like automatically expanding physical entities like < and &, which are unsupported.
So, forging ahead, Here is the necessary setup for using the low-level API:
Just like with the high-level API, this starts out with
CFURL that indicates where the XML data lives.
CFXMLParserCallBacks is a structure which contains function
pointers to the individual callbacks. This structure gets passed to the
next call to tell the parser what callbacks to use during the parsing
CFXMLTreeCreateFromData, you can give it a
data source, or you can give it a CFData with the
XML in it. The kCFXMLParserSkipWhitespace flag is
given here to tell the parser to skip whitespace between tags, although
whitespace that is next to actual text data is preserved. The callbacks
are also provided. The last argument, the context, can be a
CFXMLParserContext structure that indicates some piece of data
to pass to the callback functions. Essentially it’s a rock you can
hide some of your data under, like a pointer to a Cocoa object you’re
accumulating data into.
And finally invoke the parser and clean up the mess. Note that
CFXMLParserParse doesn’t actually give you back the
root object for the tree that is created. You’ll have to squirrel
that away yourself.
Before digging into the callbacks themselves, here is the Cocoa object that will be used to build the in-memory representation of the XML data:
The implementation is straightforward and can be found on the download page. There’s also a global variable used to indicate the root of the tree:
The first callback,
createStructure, is also the most
complicated one, since that’s where most of the work happens. Here
it is in pieces:
The function signature is pretty simple:
You’re given the parser, the XML node currently under scrutiny, and the info pointer specified in the context when the parser was created.
Right off the bat, look at the type of the node, and then switch on it. The type will have a big impact on the work that needs to be done.
And also declare a result variable:
The return value from the
will be held on to by the parser, and then given to other callbacks as
necessary to build the parent/child relationships described in the
XML. If you return NULL, the parser assumes you
don’t want to do anything with this particular element and will just
The first case you’ll get is
kCFXMLNodeTypeDocument (if you lop off the
kCFXMLNodeType" from the constant name, what’s left over is
the really important bit) This is the
version="1.0" encoding="UTF-8"?> header of
the XML file. For our parsing here, create the root object
that holds everything:
The second case is
kCFXMLNodeTypeElement. This is the node type for
the guts of the XML file. If you had something like the critter
example above, the
NodeTypeElement case would be executed three times,
once for the critter, once for type, and once for name. The type and
name would then become children of critter.
A new node is created with the type set to CFXMLNode’s string, which is the name that’s in the tag. Then the code gets ahold of the elementInfo for the node. This structure has a dictionary of attributes, an array which tells you what order those attributes were, and a flag telling you if you have any attributes at all. Note that this dictionary can get reused, so if you want to hang on to the dictionary, you’ll need to make a copy of it.
kCFXMLNodeTypeElement case is where most of the interesting stuff
will happen. You can look at the type to decide which class to create,
such as MECritter and MECritterName. You could pass the dictionary of
attributes to the init functions for the the classes, who would then
use the attributes to set whatever properties or features are
The next chunk of code to consider is where you get the text that lives between the tags.
In this case, the callback is just returning the string. In the
addChild callback, this will be added to the parent
object. Since this string can be reused too, you’ll want to make a
copy of it.
The last common case is expanding entity references, like
>. There is the
kCFXMLparserReplacePhysicalEntities flag you could
give to the parser which supposedly will expand the basic entities,
that doesn’t really work (another “unsupported feature”), so you
have to have code to expand the basic entities, as well as any other
entities your XML might be using (which somewhat defeats the use of the XML
services as a general XML parser). You can set up a dictionary of entities
beforehand, and stash it into a global.
... and then use that global in the
There’s a bunch of other constants you can look at, but this is the
minimal set you’ll need to get going. For completeness, here’s the
addChild, which is called with a pair of
structures you’ve previously created:
Wrappers and alternatives
As you can see, there are some missing features and rough corners on Apple’s XML services. Out There on the net are wrappers for Apple’s services, as well as other toolkits for dealing with XML data.
an Objective-C wrapper around
CFXMLTreeCreateDataFromURL, and also provides a nicer
interface to the CFTree that gets returned.
Iconara DOM is a Cocoa framework for both reading and writing XML data.
The MetaObject folks have MPWXmlKit, which provides XML archiving
support to classes that support the
protocol, and also includes an XML parser and XML
Since the code here just sticks everything into an
XMLNode, it’s pretty
easy to make one a child of the other. You could add specific logic
here to take different actions based on what the parent and the child
are. Although if you use a consistent name for the “add a child”
features of your container classes, this callback can be kept really
The last required callback is
when the parser is done with all the work associated with parsing a
structure, in case there’s any finalization that needs to be done.
There are two optional callbacks. The first,
resolveEntity, which is for this case:
This will let you decide how to expand
&part1;. Unfortunately, this
feature is documented as “unsupported”, and so isn’t available.
The final callback gets called when errors happen.
The status code is a numeric constant that indicates the kind of error encountered, like malformed names or tags, or empty documents. For displaying end-user errors you can get a string describing the error, as well as the line number and position in that line that the error happened. Return a YES value to try to continue if a non-fatal error happened. Returning NO will cause parsing to stop immediately.
That, in a nutshell, is what you need to read basic XML documents. These are especially handy when you know what you’re going to be getting and are reasonably assured that the XML will be correct.
Mark Dalrymple (email@example.com) has been wrangling Mac and Unix systems for entirely too many years. In addition to random consulting and custom app development at Borkware, he also teaches the Core Mac OS X and Unix Programming class for the Big Nerd Ranch.