Introduction to XML and C++

Over the last few years a growing number of applications and services have been using a type of text mark-up known as XML. The structure of XML, and the timing of its introduction, made it a perfect match for the new (at that time) and fast growing language Java. However, its use in C++ has lagged behind somewhat, and this series of articles is aimed at redressing the balance a little.

The aim of the series will be, firstly, to give some background to what XML is and when to use it, secondly to give some guidance on using the main parsers in C++ code, and finally, some advanced techniques in parsing XML files.

The roots of XML can be traced back to the 1980’s, to a mark-up language called SGML (the Standard Generalized Markup Language) that was created by IBM. This was, in turn, based upon the GML, an earlier mark-up language. In 1986 the ISO standards committee adopted SGML as a standard.

SGML is big, powerful and complex to learn. It was also widely used for marking up documents that needed to be interoperable between different systems. When a standard document definition format was needed for Web pages, it was natural to base it upon SGML, and so a vastly simpler subset of SGML was created called HTML. This was easy to learn and this ease contributed greatly to the growth of the Web.

But HTML has many drawbacks, not the least that different browser writers tended to interpret it in their own way. Other drawbacks are that HTML mixes content and the display of content and that there is no way of extending the language except through a standards process.

So work was started in 1996 on another mark-up language that would combine the power of SGML with the simplicity of HTML. It should be easily extendable, efficient to parse and simple for programmers to use in applications. It would be platform and language independent. In Feb. 1998 World Wide Web Consortium (W3C) made it a Recommendation.

It is known as XML, the eXtensible Markup Language.

But not only was it a simpler version of SGML, XML has formed the basis for a family of related technologies and protocols that increase its power tremendously.

  • XLink describes ways to add hyperlinks to XML
  • XPath describes a path to a particular point within an XML document.
  • XPointer builds on XPath to point to parts or ranges of an XML document
  • XSL is a stylesheet language that specifies how the different elements in an XML doc are to be displayed.
  • XSLT, XSL Transformations, provides a powerful way of transforming an XML doc into different formats
  • XMLSchemas are a way to specify what is valid or not within a document.
  • DOM, the Document Object Model, is a standard object model of an XML document with a standard API.
  • CSS is a style sheet language, widely used on the Web, that can also be used to define how an XML document can be displayed.

XML allows a document definition (DTD or XMLSchema) to describe the semantics of the XML, and so many different applications of XML have appeared, such as RTF, XUL, MathML and XHTML, each oriented towards a particular domain of application.

XML is not the right thing for all occasions. It is not an object database replacement. It is not a programming language. It is probably overkill for writing letters to your mother. But, as a backend document format it is very powerful and increasingly widely used.