Using SAX Parsers

4.7k words, 29 mins

This article will introduce the subject of parsing XML files, using as examples the Expat parser and the Xerces parser. In the process we will examine the two event interfaces for XML parsers, SAX1 and SAX2. I will assume that you’ve read the two previous articles in the series (Introducing XML by David Nash and History of Unicode by myself) and I assume that you have a good understanding of C++. The article won’t cover the design of XML documents, the samples we use will from necessity be simple and designed to demonstrate the basic facilities of the XML parsers. We will create a simple program to parse an XML file and count the characters and tags in it, showing how the program differs between Expat and Xerces.

Setting up and using a SAX parser.

Introduction

I’m assuming that the intended audience has never used an XML parser before, if you have you may want to wait till further articles appear. My intention is to give a basic overview of setting up a parser and reading in an XML file.

The two interfaces we will play with, SAX1 and SAX2, are called Event Based API’s and are straight-forward interfaces utilising callback functions. Originally designed for Java, they are available in many other languages, including C and C++. The original SAX1 API was designed by the members of the xml-dev mailing list in 1998 and released as a ‘de facto’ standard to the programming community. In May 2000 SAX2 was released which included the use of namespaces, filter chains and methods for querying and setting parser properties and is the recommended API to use for current applications.

Strictly speaking the SAX API is designed for Java and is described by a set of Java classes . Although the API has been ported to other languages (such as C) the ports do not, and cannot, mirror the Java API exactly. I use the term SAX rather loosely in this article to describe event based XML parsing.

In particular, Expat is a C based parser and has no classes as such, so the interface is of necessity an approximation. Expat is also based on the SAX1 API, so you may wonder why we are going to start the article with looking at how to use Expat. Well, for one thing, Expat is one of the most widely used XML parsers in the C world (its also the basis for the Perl and PHP XML modules). It is extremely fast and has a small disk and memory footprint. If you want to use XML in your own applications, Expat may be the first thing that you look at. And finally, there are some C++ wrappers available, one of the best being Jez Higgins’ SAX in C++.

The Interfaces

As I said above, the SAX interfaces were originally designed for Java programs, they exist in the org.xml.sax set of packages and consist of interafces, classes and sub-packages.

In SAX1 there are interfaces for the parser, handlers, exceptions and so on, SAX2 kept the same basic structure but deprecated some of the interfaces and added some new ones. For example, Parser is now deprecated and replaced by XMLReader.There are often also a set of helper implementations that provide barebones functionality, such as the SAX2XMLReader implementation of the XMLReader interface.

I’ll describe some of the main classes in SAX2 here using the Java names, how that works out in the C Expat parser and the C++ Xerces parser will become clear later in the article. I’ll mention when the class is replacing one of the SAX1 classes

The first set of classes are those used for parsing an XML file. They are based on the XMLReader interface (SAX1: Parser), and usually created via an XMLReaderFactory. Most implementation provide an adapter implementation called SAX2XMLReader. The XMLReader interface provides methods for setting and getting features and properties for the parser, setting handler and a method called parse(). Typically we derive a class from XMLReader, create it, set the properties and features that we want, set the handlers and then call the parse method. Parsing can throw exceptions, generate errors and warnings and call the handlers.

The handlers are used for handling the various events and come in different flavors. The main handler is based on the ContentHandler (SAX1: DocumentHandler) interface. There are also the ErrorHandler (errors, warnings and fatals), the DTDHandler (DTD’s) and the EntityResolver (for external entities). There is also a Locator class that is used to keep track of where in an XML file the parser is.

The XMLReader is normally passed an InputSource (which encapsulates an input stream) and during parsing can throw various SAXExceptions.

There are various helper classes and interfaces available, such as the Attributes class to encapsulate attributes, different adapter and implementation helper classes and a factory class to produce the required parser.

In a nutshell, to parse an XML file, you must create a parser, tell it what functions will handle the various events it creates from the file and then let it rip on the file.

The Xerces parser has pretty much the same classes and structure as the Java SAX API (along with other classes for DOM and so on), but Expat, because it is written in C, has a set of functions that try to mimic the above functionality as much as possible. Next we’ll have a look at Expat.

An Example XML Document

We need a simple XML document to play with. I could start off with an example using the hypothetical Person who is part of a hypothetical AddressBook, but I won’t. I’ve seen enough of them and I refuse to write another… instead we’ll look at a hypothetical configuration file for a hypothetical application.

Lets assume that we have an application that stores its state (user name, last used file, etc.) at shut down in an XML file. And reads this file at start up to restore its state to what it was before.

Listing 1 shows the first cut at this file, we’ll expand on it further as we go.

Using Expat.

Expat was originally written by James Clark, one of the pioneers in the world of XML and SGML. Its written in C and is available on many platforms. Recently development of the parser moved to SourceForge, where it is overseen by Clark Cooper. To compile the following examples you will need to download and install the Expat parser from expat.sourceforge.net (I used the 1.95.2 version for this article).

Listing 2 shows a bare bones program that creates a parser, parses a file and exits. In the process it counts the number of characters, tags and attributes that it reads.

On its own, not a terribly exciting application, but it demonstrates the four fundamental actions in using Expat to parse an XML file: create a parser, assign event handlers, give it data to parse, and then free the parser.

All the Expat functions are prefixed by “XML_” and they all take an instance of the parser as their first parameter (except, obviously, the XML_Create function).

We create a parser using XML_Create(NULL) which returns a pointer to an XML_Parser, and it this pointer that we pass around to the other functions and is finally used in XML_ParserFree(p) to free the parser.

XML_Create has one optional parameter, the document encoding, which overrides the document encoding specified in the document itself. In this case we pass NULL to indicate that the document will specify the encoding, or else to use the default UTF-8 encoding.

The interesting stuff happens between these two calls to create and free the parser. As the parser reads through the file it will generate events whenever it encounters various parts of the XML, for instance, start tags, end tags and so on.

To express our interest in these events, we register a set of functions with the parser that it will callback on when a specified event happens. In listing 2 we tell the parser that we are interested in knowing whenever a start tag, end tag or character data is encountered.

We use the functions:

  XML_SetElementHandler(p, startFn, endFn);
  XML_SetCharacterDataHandler(p, characterFn);

to register three static functions to handle the callbacks, as usual, passing the pointer to the parser as the first parameter. The other parameters are the function pointers for the start handler, end handler and character data handler.

We make use of a third callback method:

  XML_SetUserData( p, pec );

to tell the parser to pass this pointer to the callback handler functions. This can be a pointer to any data that we want passed to the functions (its a void*), in this case we declare a structure ElementCounts to keep track of the tag and character counts that we receive. Note that we are responsible for disposing of the structure when we are finished with it, before freeing the parser.

Now, on to the handlers themselves. The start handler takes the form:

void* startFn(void* data,
              const XML_Char* name,
              const XML_Char** attr)

The first parameter is the pointer to the user assigned data mentioned above. The second parameter is a pointer to a character array containing the element name and the third pointer is to an array of character pointers to the attributes. Attr[0] is the first attribute name, attr[1] is the value and so on.
The end handler takes the form:

void endFn(void *data,
           const XML_Char *name);

with similar meaning to above. If the tag is an empty tag (e.g. <br/>) then calls are made to the start and end handlers in order.

The character data handler is a little more complex, it takes the form:

void characterFn(void* userData,
                 const XML_Char* s,
                 int len);

Here s is a pointer to an array of characters that are not null terminated. The number of valid characters is contained in the len parameter. Only that number of characters should be copied out and stored for use by the client. And there is no guarantee that this string is the whole string within the element. In fact typically the first call will contain blanks and new lines, the next calls will have the data and the last call contains trailing blanks and new lines. But that cannot be assumed.

After opening the file for reading, and reading in a chunk at a time, we pass this chunk to the parser in the XML_Parse method:

  XML_Parse(p, buff, len, done)

There are four parameters, a pointer to the parser, a buffer to parse, the number of characters in the buffer, and whether this is the final buffer. By passing the number of characters to the function we don’t need to ensure that the buffer is null terminated.

The function returns 0 if an error occurred, otherwise 1.

And that is the basic structure of a program for reading in an XML file and handling the various events that the parser creates. Expat provides a lot of different events, we can provide handlers for all of them, see the header files or the documentation if you’re interested. We’ll look in more detail at some of them in another article.

I’ll extend the program now to make it more practical for our purposes by reading in the user’s name and password. The main file remains the same, the changes we will make are in the handlers and in the data structure that we pass around. Listing 3 shows the changes to the handlers and structure (test03.cpp in the source package). Replace the three event handlers, define a new data structure and modify the code to print the results.

What we are doing, in short, is keeping track of which tag we are within and, based on that, collecting or ignoring the character data that we are passed.

Some points to note from the code:

1. The characterFn function can be called more than once within the same tag, so the characters will be appended to the string until we reach the end of that tag.
2. If we look at the strings as output by std::cout, we will see that we also get some of the white space:


User name: <
jsmith
>

Our program has also appended the new line/carriage return characters to the string. So, in a real world application, we would want to trim the string of extra characters.

3. Using a vector to store the tag names and indexing into it is just one way of keeping track of which tag we are in, there are many others. Using a stack, pushing the current tag onto it, and then popping it off in the end handler is another popular technique. With a large number of tags a map<string,int> is a good solution.

Handling attributes

The attributes are passed to the start element handler as an array of char*’s, the first element of the array being the first attribute name, the next is the value, the next is the second attribute name and so on. The list is ended by a NULL entry. In order too keep this article short enough, the online source file (test04.cpp) has the details, I’ll just give a verbal description here.
A typical means of accessing the attributes is simply to loop through them, like so:

for( int i = 0; attr[i]; i += 2 ) {
    // do something with attr[i] and attr[i+1]
}

attr[i] and attr[i+1] will point to a XML_Char* and we will need to make a copy if we want to hang on to them. In our example, we assign them to a string.

Error handling

Errors in an XML file can be broken down into 3 types.

  1. System level errors ( bad file, disk error and so on).
  2. Badly formed XML
  3. Non validated XML

System level errors can be taken care of in the normal way, such as checking that the file can be read and so on. Non validated XML errors will not happen with Expat as it is not a validating parser. That leaves us with Badly formed XML errors.

Expat is quite good at returning intelligent parser error strings (in English) or error codes, and there are methods to find the line number, column number and byte offset of the offending byte. (Note that it is a byte offset, not a character offset). An error is indicated when the XML_Parse() method returns 0, in which case the error code and error string methods can be called.

So far we’ve looked at just a few of the functions in expat.h, I’ll take a look at some of the other functionality in another article, what we’ve covered so far is enough to have you parsing XML.

The Xerces Parser

Expat is designed to be small and fast and useable on all platforms, an aim that it achieves but at the cost of a slightly clumsy user interface and only supporting the SAX1 interface. The Xerces parser is at the other extreme, providing SAX1, SAX2 and DOM1 and 2 interfaces, all wrapped in a C++ API. There are language bindings for Java, C++, Perl and MS COM. Like Expat, the library is intended to be cross platform across a wide range of operating systems.

Xerces was a project started by the Apache foundation in 1999 (based on IBM’s XML4C) and is still in development. But although still evolving, it is known to be stable and is in use in many applications. Currently (May 2002) the version is at 1.7.0.

Xerces has a different philosophy than that of Expat. Whereas Expat does one thing very well, Xerces aims to provide a full toolkit of XML parsing tools, it support SAX1 and 2, DOM1 and 2, namespaces and XMLSchema. It is also part of a larger toolkit, hosted at xml.apache.org, that includes a wide range of tools for working with XML.

Using Xerces

To understand the differences between Expat and Xerces, we’ll do exactly the same in Xerces as we did in Expat. See listing 4 (source test05.cpp) for the barebones code to create a parser, read a file and exit (the code does nothing practical). In this example we will make use of the SAX2 interface.

The first difference that jumps out from this code is that the Xerces library needs to be initialized before it can be used, and terminated when it is no longer needed, via calls to the two static methods XMLPlatformUtils::Initialize() and XMLPlatformUtils::Terminate(). The actual working of the calls will depend on the platform Xerces is built for. (Note, on Xerces V1.5 and earlier there could be one, and only one, call to Initialize in an application, otherwise the application would segfault. This has been rectified in the later versions).

A second difference is that we now create our parser using a factory method, XMLReaderFactory::createXMLReader(), which returns an instance of the parser (or reader as it is called in SAX2).

Finally we note how the handlers are created. There are three main handlers that the parser makes use of, a document handler for the content of the XML document, an error handler for any errors or warnings in the parse and a DTDHandler. Xerces provides a utility class DefaultHandler, that acts as a ‘do nothing’ class and can be used in place of an actual handler class. By deriving from this we can implement just the functionality that we need.

All in all, a much cleaner interface than that of Expat. To do some useful work in Xerces, the only thing we need to do is provide a document handler class that can handle the events created by the parser, and we do that by inheriting from the DefaultHandler class. In deriving from DefaultHandler we can choose to override the methods that we need.

DefaultHandler inherits from five abstract classes in total (ContentHandler, ErrorHandler, EntityResolver, DTDHandler and LexicalHandler) but at present we are only interested in dealing with start element, end element and character data events from the ContentHandler interface.

Here is the hander class (from test06.cpp):

class OurHandler : public DefaultHandler
{
    public:
    OurHandler() : charCount(0),tagCount(0),attrCount(0){}

    void startElement( const XMLCh* const uri
                      ,const XMLCh* const localname
                      ,const XMLCh* const qname
                      ,const Attributes&  attrs )
{
        ++tagCount;
        attrCount += attrs.getLength();
}

    void endElement(const XMLCh* const uri
                   ,const XMLCh* const localname
                   ,const XMLCh* const qname)
{}

    void characters( const XMLCh* const chars
                    ,const unsigned int length)
{
        charCount += length;
}

    int getCharCount() { return charCount; }
    int getTagCount() { return tagCount; }
    int getAttrCount() { return attrCount; }

    private:
    int charCount;
    int tagCount;
    int attrCount;
};

and use that as the content handler instead of the DefaultHandler:
    OurHandler handler;
    parser->setContentHandler(&handler);
    parser->setErrorHandler(&handler);

Add some code at the end to print out the results and voila, the Xerces equivalent to the program we wrote in the Expat section. Because all our handlers are tucked up neatly in a class, there is no need to pass around a separate structure to store the data, it can be part of the class. (See source test06.cpp).

You’ll notice that we’re using the handler as a content handler and as an error handler, this works because the super class, DefaultHandler, supplies three do-nothing handlers for the error functionality (warning(…), error(…), fatalError(…)), as well as a few other methods. This makes it easier to specialize the class for just the functions that we need. In a full system, you would probably use separate classes for content handling and error handling.

In a similar manner to our example in Expat, we can modify the program to extract some of the data simply by providing a different set of handlers that detect the ‘user’ and ‘login’ tags and saves the data. See test07.cpp for the details.

The technique is similar to that used in the Expat example, in the start handler, we keep track of which element we are within and in the character handler we collect the strings that we are interested in.

The main difference is in the way the attributes are presented to us: Xerces creates an object of type Attributes. Attributes is usually implemented as a kind of vector, which contains a list of attribute name/value pairs (Attributes itself is an abstract class). These can be retrieved either by index or by name. For example:

XMLCh* timestamp = attrs.getValue( 0 );
or:
XMLCh* timestamp = attrs.getValue( &timestamp& );

An Attributes implementation will also support a set of other methods, allowing us to find the type of the attributes, the number of attributes and so on.

If you take a look at the code in test07.cpp, you’ll note that I’m not simply fetching a pointer to a char array. The actual code, in brief, is this:

 char buff[BUFF_SIZE];
 XMLString::transcode( attrs.getValue((int)0), buff, BUFF_SIZE-1);

and this deserves a brief explanation, although its a complex matter that I’ll devote more time to in a future article. You’ll remember that I mentioned that Xerces deals with UTF16 encoding internally, and that XMLCh is typedef’d to be a unsigned short (or a wchar_t). However, in our simple examples, we’re dealing with plain old character data, so we need to transform it. For this we use the transcode method from the XMLString utilities, here the result of getValue() is transcoded and stored into the buffer. The transcode family of methods are a bit more complex than this quick usage would imply, but more on that at a later date.

Summary

This has been a short tour of the Expat and the Xerces parser, two of the main SAX type XML parsers available.

The Expat parser has a long and distinguished pedigree, having been created by one of the luminaries of the SGML world, James Clark, and it has been in use in real world applications for many years now. Updates to the code are few and far between, which is a sign that it works well and the bugs have been ironed out of it. Despite a ‘C-style’ interface, the basic functionality is easy to work with and for simple jobs this is usually the right choice.

Xerces is a parser that is still in development and aims to cover a lot more ground than Expat. It is a fully object oriented design and API and covers SAX1, SAX2, DOM1 and DOM2 API’s. Despite being in development the parser is stable and usable in a production environment, although you may not want to rely on some of the more esosteric functionality without extensive testing.

In this article I’ve given a brief overview of what is involved in setting up a parser and parsing a simple file. There are lots of online resources that can take you through the next step, of using them in real world applications. In the next article I’ll skip the ‘intermediate’ phase and come back to look at some of the more obscure aspects of parsing XML.

FURTHER READING

Expat Home page:

http://expat.sourceforge.net

Xerces Home page:

http://xml.apache.org/xerces-c/

The SAX project.

http://sax.sourceforge.net/

ExpatPP

http://www.oofile.com.au/xml/expatpp.html

C++ Wrapper from Tim Smith

http://www.codeproject.com/soap/ExpatImpl.asp

SAX in C++ from Jez Higgins

http://www.jezuk.co.uk/SAX/

Oxml wrapper (pages in French).

http://apodeline.free.fr/Oxml/

LibXML

http://xmlsoft.org/


Listing 1, the sample xml file, version 1

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE config>
<config datecreated="20011210">
  <user>
    John Smith
  </user>
  <login>jsmith</login>
  <password>topsecret</password>
  <lastfiles>
    <lastfile timestamp="20011210T1002">accounts.txt</lastfile>
    <lastfile timestamp="20011190T1132">/home/jsmith/docs/letter.doc</lastfile>
  </lastfiles>
</config>

Back

Listing 2, basic parsing

File: test02.cpp
/*
 * Test of SAX parsing, using Expat
 *
 * Author: Tim Pushman, gnomedia 2002
 */
#include <iostream>
#include <fstream>
#include <string>

#include <expat.h>

#define BUFFSIZE     2048

typedef struct {
  unsigned int tagCount;
  unsigned int attrCount;
  unsigned int charCount;
} ElementCounts;

static void
startFn(void* data, const char* el, const char** attr)
{
  ElementCounts* pc = (ElementCounts*)data;
  pc->tagCount++;
  for( int i = 0; attr[i] != NULL; i += 2 ) {
    pc->attrCount++;
}
}

static void
endFn(void* data, const char* el)
{
}

static void
characterFn( void* data, const XML_Char* ch, int len )
{
  ((ElementCounts*)data)->charCount += len;
}

int main( int argc, char** argv )
{
  if( argc < 2 ) {
    std::cout << "Usage: test01 some.xml" << std::endl;
    exit(1);
}

  std::string filename( argv[1] );
  std::cout << "Using " << filename.c_str() << std::endl;

  std::ifstream ifs( filename.c_str() );
  if( ifs.fail() ) {
    std::cout << "Error opening input file, exiting..." << std::endl;
    exit(2);
}

  XML_Parser p = XML_ParserCreate(NULL);
  if (! p) {
    std::cerr << "Failed to create parser" << std::endl;
    exit(3);
}
  ElementCounts* pec = new ElementCounts();
  pec->tagCount = pec->attrCount = 0;
  XML_SetUserData( p, pec );
  XML_SetElementHandler(p, startFn, endFn);
  XML_SetCharacterDataHandler(p, characterFn);

  // parser ready and raring to go.
  bool done = false;
  int len = 0;
  int totalCount = len;
  char buff[BUFFSIZE];
  while( !done ) {
    ifs.read( buff, BUFFSIZE );
    done = ( (len = ifs.gcount()) < BUFFSIZE);
    totalCount += len;
    if( ifs.bad() ) {
      std::cerr << "Error in read operation." << std::endl;
      exit(4);
}
    if (! XML_Parse(p, buff, len, done)) {
      std::cerr << "Parse error at line " <<  XML_GetCurrentLineNumber(p);
      std::cerr << " with " << XML_ErrorString(XML_GetErrorCode(p))
                << std::endl;
      exit(5);
}
}

  // free the parser when we've finished with it
  XML_ParserFree(p);
  std::cout << "Done, nTotal chars read: " << totalCount << std::endl;
  std::cout << "Tags counted: " << pec->tagCount << std::endl;
  std::cout << "Attrs counted: " << pec->attrCount << std::endl;
  std::cout << "Chars counted: " << pec->charCount << std::endl;
  delete pec;
  return 0;
}

Back

Listing 3, Additions to main code.

Replace the data structure and the three event handlers with this:

struct UserData {
  enum { NO_TAG = -1, TAG_USER, TAG_PASS };
  std::vector tags;
  UserData( ) : done(false),currentTag(NO_TAG)
{
      tags.push_back("login");
      tags.push_back("password");
}
  std::string username;
  std::string password;
  bool done;
  int currentTag;
};

static void
startFn(void* data, const XML_Char* el, const XML_Char** attr)
{
  UserData* d = (UserData*)data;
  if( strcmp( el, d->tags[UserData::TAG_USER].c_str() ) == 0 ) {
    d->currentTag = UserData::TAG_USER;
}
  else if( strcmp( el, d->tags[UserData::TAG_PASS].c_str() ) == 0 ) {
    d->currentTag = UserData::TAG_PASS;
}
  else {
    d->currentTag = UserData::NO_TAG;
}
}

static void
endFn(void* data, const XML_Char* el)
{
  ((UserData*)data)->currentTag = UserData::NO_TAG;
}

static void
characterFn( void* data, const XML_Char* ch, int len )
{
  std::string s( ch, ch+len );
  switch( ((UserData*)data)->currentTag ) {
    case UserData::TAG_USER:
    if( !s.empty() ) ((UserData*)data)->username.append(s);
    break;
    case UserData::TAG_PASS:
    if( !s.empty() ) ((UserData*)data)->password.append(s);
    break;
    default:
    // do nothing
    break;
}
}

//
// and in the main body:
//
// new handler installation
  UserData* pud = new UserData();
  XML_SetUserData( p, pud );
//end new
// ....
ifs.close(); // as before
// new printing code.
std::cout << "User name: <<" << pud->username.c_str() << ">" << std::endl;
std::cout << " Password: <" << pud->password.c_str() << ">" << std::endl;
// end new.
delete pud;

Back

Listing 4, A barebones program to parse a file with Xerces SAX2 interface.

/*
 * Test of SAX parsing using Xerces C++ parser.
 *
 */

#include <util/PlatformUtils.hpp>
#include <sax2/XMLReaderFactory.hpp>
#include <sax2/SAX2XMLReader.hpp>
#include <sax2/DefaultHandler.hpp>

const char* xmlFile =
           "localprojectscsaxdatademo.xml";

int main( int argc, char** argv )
{

    // initialize the Xerces library.
    try
{
        XMLPlatformUtils::Initialize();
}
    catch( const XMLException& )
{
        // do something
        return 1;
}

    SAX2XMLReader* parser = XMLReaderFactory::createXMLReader();
    DefaultHandler handler;
    parser->setContentHandler(&handler);
    parser->setErrorHandler(&handler);

    try {
        parser->parse( xmlFile );
}
    catch( const XMLException& ) {
        // do something
}

    delete parser;

    // And call the termination method
    XMLPlatformUtils::Terminate();

    return 0;
}

Back