The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 2,276 other followers

Default XML encoding is UTF-8 (or better: utf-8). If it contains other byte sequences, this is an error.

Posted by jpluimers on 2021/01/21

I should have had the below answer when writing about StUF – receiving data from a provider where UTF-8 is in fact ISO-8859.

A while ago, a co-worker did not believe when I told that default XML encoding really is UTF-8 (and tried to force it to utf-8), and that if the content had byte sequences different from the (either specified or default) encoding, it was a problem.

I though I blogged about the default, and where to find it, but apparently, I did not.

My blog had (and has <g>) a truckload of articles mentioning UTF-8, less articles containing UTF-8, encoding and xml, but the ones having UTF-8, default, encoding and xml did not actually tell about a standard that really defines XML uses UTF-8 as default encoding when there is no other encoding information – like BOM (byte order mark), HTTP, or MIME encoding) available.

W3C indeed specifies it. [WayBack] utf 8 – How default is the default encoding (UTF-8) in the XML Declaration? – Stack Overflow has a summary (thanks James Holderness!):

The Short Answer

Under the very specific circumstances of a UTF-8 encoded document with no external encoding information (which I understand from the comments is what you’re interested in), there is no difference between the two declarations.

The long answer is far more interesting though.

and an elaboration:

What The Spec Says

If you look at Appendix F1 of the XML specification, that explains the process that should be followed to determine the encoding when there is no external encoding information.

If the document is encoded as one of the UTF variants, the parser should be able to detect the encoding within the first 4 bytes, either from the Byte Order Mark, or the start of the XML declaration.

However, according to the spec, it should still read the encoding declaration.

In cases above which do not require reading the encoding declaration to determine the encoding, section 4.3.3 still requires that the encoding declaration, if present, be read and that the encoding name be checked to match the actual encoding of the entity.

If they don’t match, according to section 4.3.3:

…it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration

Encoded UTF-16, Declared UTF-8

Let’s see what happens in reality when we create an XML document encoded as UTF-16 but with the encoding declaration set to UTF-8.

Opera, Firefox and Chrome all interpret the document as UTF-16, ignoring the encoding declaration. Internet Explorer (version 9 at least), displays a blank document, but no actual error.

So if you include a UTF-8 encoding declaration on your UTF-8 document and someone at a later stage converts it to UTF-16, it’ll work in most browsers, but fail in IE (and, I assume, most Microsoft XML APIs). If you had left the encoding declaration off, you would have been fine.

Technically I think IE is the most accurate. The fact that it doesn’t display an error as such might be explained by the fact that the error is occurring at the encoding level rather than the XML level. It is assumedly doing its best to interpret the UTF-16 characters as UTF-8, failing to find any characters that decode, and ending up passing on an empty character sequence to the XML parser.

Encoded UTF-8, Declared Otherwise

You might now think that Firefox, Chrome and Opera are just ignoring the encoding declaration altogether, but that’s not always the case.

If you encode a document as UTF-8 (with a byte order marker so it’s unmistakable as anything else), but set the encoding declaration to Latin1, all of the browsers will successfully decode the content as Latin1, ignoring the UTF-8 BOM.

Again this seems right to me. The fact that the BOM characters aren’t valid in Latin1 just means they are silently dropped at the character decoding level.

This doesn’t work for all declared encodings on a UTF-8 document though. If the declared encoding is UTF-16, we’re back with Opera, Firefox and Chrome ignoring the declared encoding, while Internet Explorer returns a blank document.

Essentially, anything that makes IE return a blank document is going to make other browsers ignore the declared encoding.

Other Inconsistencies

It’s also worth mentioning the importance of the Byte Order Mark. According to section 4.3.3 of the spec:

Entities encoded in UTF-16 MUST […] begin with the Byte Order Mark

However, if you try and read a UTF-16 encoded XML document without a BOM, most browsers will nevertheless accept it as valid. Only Firefox reports it as an XML Parsing Error.

External Encoding Information

Up to now, we’ve been considering what happens when there is no external encoding information, but, as others have mentioned, if the document is received via HTTP or enclosed in a MIME envelope of some sort, the encoding information from those sources should take preference over the document encoding.

Most of the details for the various XML MIME types are described in RFC3023. However, the reality is somewhat different from what is specified.

First of all, text/xml with an omitted charset parameter should use a charset of US-ASCII, but that requirement has almost always been ignored. Browsers will typically use the value of the XML encoding declaration, or default to UTF-8 if there is none.

Second, if there is a UTF-8 BOM on the document, and the XML encoding declaration is either UTF-8 or not included, the document will be interpreted as UTF-8, regardless of the charset used in the Content-Type.

The only time the encoding from the Content-Type seems to take precedence is when there is no BOM and an explicit charset is specified in the Content-Type.

In any event, there are no cases (involving Content-Type) where including a UTF-8 XML encoding declaration on a UTF-8 document is any different from not having an encoding declaration at all.

 

It also pointed me to

4.3.3 Character Encoding in Entities

Each external parsed entity in an XML document may use a different encoding for its characters. All XML processors must be able to read entities in both the UTF-8 and UTF-16 encodings. The terms “UTF-8” and “UTF-16” in this specification do not apply to related character encodings, including but not limited to UTF-16BE, UTF-16LE, or CESU-8.

Entities encoded in UTF-16 must and entities encoded in UTF-8 may begin with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], section 16.8of [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.

If the replacement text of an external entity is to begin with the character U+FEFF, and no text declaration is present, then a Byte Order Mark MUST be present, whether the entity is encoded in UTF-8 or UTF-16.

Although an XML processor is required to read only entities in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read entities that use them. In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration:

Encoding Declaration
[80] EncodingDecl    ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )
[81] EncName    ::= [A-Za-z] ([A-Za-z0-9._] | '-')* /* Encoding name contains only Latin characters */

In the document entity, the encoding declaration is part of the XML declaration. The EncName is the name of the encoding used.

In an encoding declaration, the values ” UTF-8 “, ” UTF-16 “, ” ISO-10646-UCS-2 “, and ” ISO-10646-UCS-4 ” should be used for the various encodings and transformations of Unicode / ISO/IEC 10646, the values ” ISO-8859-1 “, ” ISO-8859-2 “, … ” ISO-8859- n ” (where n is the part number) should be used for the parts of ISO 8859, and the values ” ISO-2022-JP “, ” Shift_JIS “, and ” EUC-JP ” should be used for the various encoded forms of JIS X-0208-1997. It is recommendedthat character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to using their registered names; other encodings should use names starting with an “x-” prefix. XML processors should match character encoding names in a case-insensitive way and should either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-registered encodings).

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

It is a fatal error for a TextDecl to occur other than at the beginning of an external entity.

It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding. Specifically, it is a fatal error if an entity encoded in UTF-8 contains any ill-formed code unit sequences, as defined in section 3.9 of Unicode [Unicode]. Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.

Examples of text declarations containing encoding declarations:

<?xml encoding='UTF-8'?>
<?xml encoding='EUC-JP'?>

F.1 Detection Without External Encoding Information

Because each XML entity not accompanied by external encoding information and not in UTF-8 or UTF-16 encoding must begin with an XML encoding declaration, in which the first characters must be ‘<?xml‘, any conforming processor can detect, after two to four octets of input, which of the following cases apply. In reading this list, it may help to know that in UCS-4, ‘<‘ is ” #x0000003C ” and ‘?’ is ” #x0000003F “, and the Byte Order Mark required of UTF-16 data streams is ” #xFEFF “. The notation ## is used to denote any byte value except that two consecutive ##s cannot be both 00.

With a Byte Order Mark:

00 00 FE FF UCS-4, big-endian machine (1234 order)
FF FE 00 00 UCS-4, little-endian machine (4321 order)
00 00 FF FE UCS-4, unusual octet order (2143)
FE FF 00 00 UCS-4, unusual octet order (3412)
FE FF ## ## UTF-16, big-endian
FF FE ## ## UTF-16, little-endian
EF BB BF UTF-8

Without a Byte Order Mark:

00 00 00 3C UCS-4 or other encoding with a 32-bit code unit and ASCII characters encoded as ASCII values, in respectively big-endian (1234), little-endian (4321) and two unusual byte orders (2143 and 3412). The encoding declaration must be read to determine which of UCS-4 or other supported 32-bit encodings applies.
3C 00 00 00
00 00 3C 00
00 3C 00 00
00 3C 00 3F UTF-16BE or big-endian ISO-10646-UCS-2 or other encoding with a 16-bit code unit in big-endian order and ASCII characters encoded as ASCII values (the encoding declaration must be read to determine which)
3C 00 3F 00 UTF-16LE or little-endian ISO-10646-UCS-2 or other encoding with a 16-bit code unit in little-endian order and ASCII characters encoded as ASCII values (the encoding declaration must be read to determine which)
3C 3F 78 6D UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding which ensures that the characters of ASCII have their normal positions, width, and values; the actual encoding declaration must be read to detect which of these applies, but since all of these encodings use the same bit patterns for the relevant ASCII characters, the encoding declaration itself may be read reliably
4C 6F A7 94 EBCDIC (in some flavor; the full encoding declaration must be read to tell which code page is in use)
Other UTF-8 without an encoding declaration, or else the data stream is mislabeled (lacking a required encoding declaration), corrupt, fragmentary, or enclosed in a wrapper of some kind

Note:

In cases above which do not require reading the encoding declaration to determine the encoding, section 4.3.3 still requires that the encoding declaration, if present, be read and that the encoding name be checked to match the actual encoding of the entity. Also, it is possible that new character encodings will be invented that will make it necessary to use the encoding declaration to determine the encoding, in cases where this is not required at present.

This level of autodetection is enough to read the XML encoding declaration and parse the character-encoding identifier, which is still necessary to distinguish the individual members of each family of encodings (e.g. to tell UTF-8 from 8859, and the parts of 8859 from each other, or to distinguish the specific EBCDIC code page in use, and so on).

Because the contents of the encoding declaration are restricted to characters from the ASCII repertoire (however encoded), a processor can reliably read the entire encoding declaration as soon as it has detected which family of encodings is in use. Since in practice, all widely used character encodings fall into one of the categories above, the XML encoding declaration allows reasonably reliable in-band labeling of character encodings, even when external sources of information at the operating-system or transport-protocol level are unreliable. Character encodings such as UTF-7 that make overloaded usage of ASCII-valued bytes may fail to be reliably detected.

Once the processor has detected the character encoding in use, it can act appropriately, whether by invoking a separate input routine for each case, or by calling the proper conversion function on each character of input.

Like any self-labeling system, the XML encoding declaration will not work if any software changes the entity’s character set or encoding without updating the encoding declaration. Implementors of character-encoding routines should be careful to ensure the accuracy of the internal and external information used to label the entity.

–jeroen

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

 
%d bloggers like this: