In this article I will provide some background to character sets and character encodings. The focus is on what is needed to work with XML parsers, as a preliminary to further articles in the series. For this reason there are some areas (glyphs and representation for example) that have not been covered.
It is impossible to work with XML and not come across the subject of character encodings. If XML is the markup language for a document and characters are the atoms that make up the document, then XML will need to have intimate knowledge of how the document is encoded, in order to understand what a character is in this document.
And given the multitude of platforms, operating systems and serialisation formats, that is no simple task. The design of the Universal Character Set (or Unicode) was an attempt to standardise how a character was represented in a computer and is thus an important part of making XML a standard that is not dependent on any underlying implementation. The various Universal Transformation Formats (UTF) are a way of standardising how the UCS is encoded in a serial format.
In the beginning were the dot and the dash… probably the earliest form of character set and encoding was that used by Morse for the telegraph, back in 1844, when he sent the famous first message from Washington to Baltimore ("WHAT HATH GOD WROUGHT"). Although not related to any base 2 encoding, Morse code was the first attempt to represent alphabetic characters as a series of bits (or, in this case, dots and dashes) . Morse code was a varying length code, using one bit for the common characters "E" and "T" and up to 6 bits for some punctuation characters. And, yes, I am using the term "bit" rather loosely here.
Morse code worked well for human operators, but for mechanical processing, a fixed length code would be a great improvement and in 1874 Baudot came up with a fixed length, 5 bit code to represent characters. By defining a "shift-in" key, he managed to get about 60 characters/numbers out of the coding. The mechanics of reading and writing this code were handled by a horrendously complicated piece of apparatus, the "keyboard" being operated by five fingers (two from the left hand and three from the right) and resembling a very short piano.
Around the turn of the century a New Zealander, Donald Murray, developed something that more closely resembles a typewriter, using codes based on the Baudot set. The main criterion for the layout of the codes was that common characters should create the least amount of mechanical movement, so the letter "E" had the value 1 (followed by "A", "S" and "U"). The Western Union Telegraph Company licensed the technology from Murray, and with a few changes to the code, it was to remain as it was into the 1950’s. In the 1930’s the French standards institute took the Baudot/Murray code and used it as the basis for the ITA2 standard ("International Telegraphy Alphabet Nr.2", I have no idea what happened to Nr.1).
So far none of the codes make any use of lower case characters or of formatting codes, although ITA2 did have codes for CR and LF. It was left to the U.S. military to come up with a larger code set that would contain the full set of upper and lower case English characters, with numerals, punctuation and a set of control characters. It was known as FIELDATA and can be seen as the precursor to the ASCII set, the alphabetic characters being in sorted order ("a" - "z") and the numerals in numeric order. It was a 6-bit code (the standard size of a character in those days).
In June 1963, on the basis of the FIELDATA codes, the American Standards Institute (in reality IBM and AT&T) created the ASCII-63 standard (American Standard Code for Information Interchange). ASCII-63 is almost recognizable to us, the control codes are all below 0x20, space is 0x20 and then follow the numbers, punctuation and upper case characters (with "A" as 0x41). The only glaring omission in ASCII-63 is that there are no lower case characters!
In October 1963 the ISO standards body decided that the world needed lower case characters so these were added in, some minor changes made to the punctuation characters and released the standard as ECMA-6. In 1967 the ASA adopted the ECMA-6 and they released it as ASCII-1967, a 7-bit code containing 128 character codes that has remained in use until today.
Apart from some accented characters in the original Baudot code, all the above codes contain only the standard English characters. At first other countries started replacing some of the characters/punctuation with their own national characters and registering these changes as 7-bit ISO code sets. Unfortunately, this caused total incompatibility between countries that wanted to exchange data and so ISO extended the ASCII data set to be 8 bits long, thereby doubling its size. The original 128 7 bit codes were kept as before, and countries were able to utilize the other 128 codes for their national character sets. This resulted in an explosion of differing code sets, mainly of the ISO 8859-n type, but also including Shift-JIS, ISO-2022-JP, and J-EUC for example. IBM trotted off to do its own thing by inventing EBCDIC, also an 8-bit codeset.
Clearly, the stage was set for another upgrade of the systems that we use for representing characters in a computer. In brief, a larger set of characters existed than could be represented by one homogenous 8-bit code set. A 16-bit character set was needed.
Time for a standards committee! Or two in fact, the Unicode Consortium and ISO/IEC. Fortunately for the sanity of programmers everywhere, these two bodies have decided to cooperate and effectively the two standards are interchangeable. The Unicode Consortium (a special interest group of US manufacturers) was first off the ground in defining a 16-bit, Unicode V 1.0, first published in 1991, followed in 1993 by V 1.1. The ISO/IEC had been creating something completely different, but with V1.1 of the Unicode standard, they adopted that and it became ISO/IEC 10646 Universal Multiple-Octet Coded Character Set (normally abbreviated to UCS). Since then the ISO/IEC standards and the Unicode standards have remained in step, the main difference being that Unicode is a 16 bit subset of the ISO/IEC 10646 32 bit character set, but for practical purposes they are interchangeable.
Unicode V 2.0 arrived in 1996, followed by V 2.1 in 1998, V 3.0 in 1999 and V 3.1 in 2000.
The remainder of this article will examine Unicode / ISO/IEC 10646-1 in further detail. I will refer to Unicode to cover both from now on, unless there are some differences to be pointed out. Technically, Unicode is a subset of a vastly larger set of codes covered by 10646.
Before continuing, it would be good to define some terms that we will be using:
- character - beyond the usual term of an alphabetic character, used to create words, a character is any atomic component with semantic significance, thus including numbers and punctuation .
- character set is a set of such characters that can be used together to create words and sentences in a particular language. For example, the Latin character set, or the Cyrillic character set.
- a coded character set is a character set and its associated (numeric) codes. For instance, ASCII defines a coded character set, where the Roman letter "a" is represented by the number 97, "b" by 98.
- a code point is a character code within a character set. For example, the code point to "A" is 0x0041 (dec 65) in ASCII and in Unicode.
- an encoding is a serialised form of a coded character set, as used for files or strings. An encoding maps a character onto one or more bytes. Examples of encoding schemes are UTF-8, Cp1296, ISO-8859-1 and GBK (Simplified Chinese).
- UCS (Universal Character Set) is a term commonly used in XML to describe both Unicode and the ISO/IEC 10646 character systems.
- A script is the set of characters needed write a set of languages, such as the Latin script used for most European languages, or the Devnagiri script used for Indian languages. Some languages, such as Japanese, use more than one script.
Once we enter into the world of the Universal Character Set we enter a world of considerable complexity, a complexity that comes from the sheer number of characters that need to be represented and also by the need for compatibility for existing standards.
The ISO/IEC 10646 standard proposes both 16 bit and 32 bit representations of the worlds’ character systems, whereas Unicode (up to V3.1) is a 16 bit representation and as such is a subset of the full 10646 standard. The Unicode set coincides with the lower plane of the 10646 standard which has the first two octets set to zero. This is often called the Basic Multilingual Plane (BMP), 10646 often being thought of as a set of planes, with 256 groups of 256 planes, of which Unicode is identical to the BMP (Plane 0 in Group 0).
With Unicode V3.1 character codes were added outside of the BMP (that is, with a value greater than 0xFFFF), marking the move from a 16 bit system to a 32 bit system, see the next section for further details, in this section we will restrict ourselves to Unicode V3.0.
The Unicode character set can be divided up into four zones itemised in Table 1
ASCII, Latin-1 and Latin Extended
Greek, Coptic, Cyrillic, Armenian, Hebrew and other Eastern lang.
Super and sub-scripts, currency
Mathematical and graphical shapes
Chinese, Japanese and Korean cursives and phonetics.
United ideographic from Chinese, Japanese and Korean languages
Yi symbols, Yi Radicals, surrogate pairs
Private use, compatibility area, Arabic presentation forms, Arabic ligatures.
A closer look at the A-Zone gives us this (partial) table of code values:
|0x0000 - 0x007F||ASCII|
|0x0080 - 0x00ff||Latin-1|
|0x0100 - 0x017f||Latin Extended A|
|0x0180 - 0x024f||Latin Extended B|
|0x0250 - 0x036f||Spacing and diacritical marks|
|0x0370 - 0x03ff||Greek and Coptic|
|0x0400 - 0x04ff||Cyrillic|
|0x0530 - 0x058f||Armenian|
|0x0590 - 0x05ff||Hebrew|
|0x0600 - 0x06ff||Arabic|
|0x0700 and up||Further sets.|
The table continues in similar fashion for all the other alphabets, each script/alphabet having its own section of the code. As can be seen, the size of the block for each language varies as necessary.
All the codes blocks mentioned so far have mapped onto various international character sets, but there are some codes in the zones above (zone table) that don’t. In particular, the surrogate pairs (in Zone O) and the private use area (in Zone R)do not, directly, contain any characters.
The surrogate pair codes are important but currently not in widespread use. A standard, 16-bit code point can access 65,535 different characters in theory, and when it was realized that this was not enough, then a set of code points, called the surrogate pairs, were created. There are two sets, the low surrogate, from 0xD800 - 0xDBFF, and the high surrogate, from 0xDC00 - 0xDFFF. Low surrogate values between 0xDB80 and 0xDBFF are reserved for private use. As the name implies, the surrogate pairs come in pairs but they are treated as a single code point that maps to the range 0x100000 and 0x10FFFF (the supplementary code points). How this works in practice will become clear when we discuss character encoding schemes.
The mapping is done using the following formulas:
(S = Supplementary, H = High surrogate, L = Low surrogate)
S = (H - 0xD800) * 0x0400 + (L-0xDC00) + 0x10000 H = (S - 0x10000) / 0x0400 + 0xD800 L = (S - 0x10000) mod 0x0400 + 0xDC00
For example, the Old Italic Number 5 (looks like an inverted ‘V’) has code point 0x10321, which would give the two surrogate pairs 0xD800 (High) and 0xDF21 (Low).
From this it can be seen that the surrogate pairs add over one million more characters to the Unicode code set, all above 0x10000. Typically though, even East Asian texts contain less than 1% of their characters as surrogate pairs. Windows XP supports surrogate pairs and Java 1.4 will also.
As mentioned previously, Unicode V3.1 is the first of the Unicodes to describe characters of more than 16 bits, using the surrogate pairs described above. In fact, it includes an additional 44,946 encoded characters!
These characters are encoded outside of the BMP (with code points > 0x10000), as follows:
Supplementary Multilingual Plane (SMP) - 0x10000…0x1FFFF
Supplementary Ideographic Plane (SIP) - 0x20000..0x2FFFF
Supplementary Special Purpose Plane (SSP) - 0xE0000…0xEFFFF
SMP contains some historic scripts and more symbols, mainly mathematical and musical.
SIP contains a very large collection of Han ideographs.
SSP contains a set of tag characters.
To put things in a kind of perspective, Unicode V3.1 describes 94,140 encoded characters, of which 70,207 are Han ideographs.
Given a 16 bit character set, the simplest way to store them on a disk or send them down the wire would be as 16 bit values. This straight forward method of encoding is called UTF-16 (for Unicode Transformation Format). Each character code 0xFFFF and below is stored as a single 16 bit value. Those values above 0x10000 are represented using the surrogate pairs. Here is a typical representation of the string "Der LÃƒÂ¶wen":
D e r L ö w e n 44 00 65 00 72 00 20 00 4C 00 F6 00 77 00 65 00 6E 00
All ‘normal’ Latin characters; except for the o-umlaut, which has the code point of 0xF6, greater than the maximum 7-bit ASCII character value 0x7F.
Clearly, using a 16-bit character set and encoding is an excellent way to store all the worlds languages, and a few non-languages as well. It has two major drawbacks though, firstly, using 16-bits per character instead of the earlier 8 bits will double the size of a text file. And given that 90% of the text in the world (at least on the Internet) can be easily handled with 8-bits, it would seem a bit wasteful to double the size of all files. And secondly, old legacy files cannot work with a 16-bit application unless converted to 16 bit Unicode.
For this reason another character encoding is also defined, UTF-8. UTF-8 uses 8 bit values to store Unicode characters. All characters below 0x007F are stored in an 8 bit value, characters between 0x0080 and 0x07FF are stored in a 16 bit value, those between 0x0800 and 0xFFFF are stored in a 24 bit value and those above 0x10000 are stored in 32 bit values. See the table below for finer details.
UTF-8 encoding solves both the problems mentioned above, as all current ASCII files will not change their encoding, the code points below 0x007F stay unchanged. Here is "Der Löwen" again:
D e r L ö w e n 44 65 72 20 4c c3 b6 77 65 6e
Note the "c3 b6" value that represents the o-umlaut character.
UTF-8 encoding will keep the size of current ASCII files the same, files that contain some extended ASCII values will increase in size proportionately.
There is also a character encoding called UTF-32, which as you can guess is a 4 byte (32 bit) representation of the character codes. It is not, as far as I know, in general use and we will ignore it for the rest of the article. It is the same as UTF-16 with the first two bytes set to 0x00 and it does not need to have a surrogate pairs section.
So which encoding is the best to use? It depends on what characters the source file contains. Files with a lot of ASCII will be better off in UTF-8. If they contain a lot of extended ASCII they may double in size and if they contain a lot of non-Latin extended characters, the file could end up three or four times larger. A UTF-16 encoded file will always be double the size unless it consists mainly of surrogate pairs (an unlikely occurrence at present) in which case it will be up to four times the size.
There are of course, many other character encodings in general use, the most common on Windows platforms being Windows 1252. Many programs assume that 1252 is the same as ISO-8859-1 but it is not, 1252 defines an extra 34 characters in addition to those from ISO-8859-1. If there is a possibility that data will be used on other platforms, make sure that the program is really saving in ISO-8859-1 format and not Windows 1252. Macintosh users will probably be familiar with MacRoman encoding. Many Eastern languages use Shift-JIS or Big5 encodings. However, most XML parsers will not understand these encodings.
A parser complying with the XML specification must, at the least, understand UTF-8 and UTF-16 encodings. Most parsers will understand other encodings as well. Expat will understand UTF-8, UTF-16, ISO-8859-2 and US-ASCII ‘out of the box’ and can be extended to other formats. Xerces-C++ will understand the above encodings and adds UCS-4 (32 bit values), EBCDIC (code pages IBM037 and IBM1140), ISO-8859-1 and Windows-1252. The IBM parser, XML4C (based on Xerces), understands a further 15 encodings.
Most programmers are familiar with the BigEndian/LittleEndian differences in microprocessors, well the same differences exist in Unicode encodings, specifically with UTF-16, which can be BigEndian or LittleEndian in the same manner as microprocessor instructions can. The character "e" (65) would be represented as 0x65 0x00 in LittleEndian format and 0x00 0x65 in BigEndian format. To inform the parser which format the file is in, the file starts with a byte order mark (BOM).
All XML files start with:
<?xml version"1.0" encoding="something">
If the encoding is missing, then UTF-8 is assumed.
The question naturally arises, how does the parser start reading the file to reach the encoding part of the header? If the above section of code is in UTF-8, then the encoding part starts at position 20, if it is in UTF-16, it will be at position 40. If the file is BigEndian it needs to be read differently than if it is LittleEndian. Its a chicken and egg problem, so the parser starts a file with a little testing of its own, like so:
If the first two bytes are ‘3C 3F’ ("<?") then standard UTF-8 encoding is assumed
If the first three bytes are ‘EF BB BF’ (BOM/UTF-8) then standard UTF-8 encoding is assumed
If the first two bytes are "FF FE" (BOM/Little) then UTF-16 LittleEndian is assumed
If the first two bytes are "FE FF" (BOM/Big) then UTF-16 BigEndian is assumed.
If the first four bytes are "00 00 FF FE" (BOM/Little) then UTF-32 LittleEndian is assumed
If the first four bytes are "00 00 FE FF" (BOM/Big) the UTF-32 BigEndian is assumed
Perform one or two other checks (EBCDIC encoding for example).
"FF FE" or "FE FF" are called the byte order mark and indicate that the file is in UTF-16 format and whether it is a LittleEndian file or a BigEndian file. In the Unicode character set, 0xFEFF represents a ‘zero width non printing space’ so will not affect the printing of the file, and 0xFFFE is a non-existent character.
The actual checks performed will depend on the parser implementation, but it will be something along the lines above. Assuming all goes well, the ‘encoding="GBK"’ (for example) part will be reached and the actual encoding established.
At this point the parser will have to check whether it can support the encoding and either continue or report an error.
Throughout the parsing process the parser will be reading characters and checking for particular codes or combinations. In UTF-16 the process is reasonably straight forward: every 2 byte value is a character and can be dealt with as such, with the exception of values between 0xD800 and 0xDFFF. These indicate the start of a surrogate pair, if the pair of characters do not form a valid pair, the parser will indicate an error. Any further action the parser takes is dependent on the parser used, e.g. converting it into another format.
In UTF-8, the situation is a little more complicated as the width of the characters vary. The following table will help to understand how the parser deals with the 8 bit values it reads.
|Code Points||1st Byte||2nd Byte||3rd Byte||4th Byte|
If a byte is below 0x80, it is a character. If it is between 0xC2 and 0xDF, then fetch another byte, which must be between 0x80 and 0xBF. And so on.
From this it is clear that there are quite a number of illegal sequences, for instance 0x80 to 0xC1 cannot be a first byte. It has been cleverly arranged that a reader can ‘drop in’ on a byte stream and know which part of a character sequence it is looking at.
Whatever the format of the file being parsed, internally the parser will be using UTF-8, so the programmer will need to take care of converting it into something useful for the application like displaying in the GUI or converting to a text file.
We’ll wrap up this introduction with a short mention of C and C++ (this being the ACCU journal). It seems like a natural match to use the wchar_t to store Unicode characters in a program, but it’s not. The reason being that the wchar_t can be different sizes on different architectures. In Windows NT it is 16 bits, under Linux it is 32 bits. It could even be 8 bits on some architectures. How to program with Unicode in a portable manner is a complex subject that we will revisit in a further article, for now, it’s enough to say that for portability it’s best to specify either unsigned short for 16 bit Unicode, or long for 32 bit Unicode. The Xerces parser has a XMLCh type (a typedef for unsigned short) that is defined for the compiler being used, Expat uses XML_Char (defined as a char).
I hope this article has given a sufficient background to Unicode and its use in XML. We’ll continue the series with by getting back to simpler stuff like C++ programming, using a XML parser to read in files. But it’s important to understand the various encodings that need to be dealt with.
XML Internationalisation and Localisation by Yves Savourel (SAMS publishing)
(excellent guide to the various issues with using Unicode. Although aimed at XML users, it also has pertinent information for anyone translating progams).
For all the information you could ever want about internationalization, character sets, encodings and glyphs, pay a visit to the Unicode Consortium website.
Latest version of the Unicode standard:
The I18N Gurus page
Open directory of links to internationalization (i18n) resources and related material.
The very excellent piece by Tom Jennings:
Annotated history of character codes, which I borrowed heavily from in the Introduction.
Another from Joel Spolsky.
International Character Codes overview (from 1995):
The following RFC’s are of interest in working with Unicode:
RFC 2781 - UTF-16, an encoding of ISO 10646
RFC 2279 - UTF-8, a transormation format.
RFC 2152 - UTF-7 A Mail-Safe Transformation Format of Unicode