The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My work

  • My badges

  • Twitter Updates

  • My Flickr Stream

    20140508-Delphi-2007--Project-Options--Cannot-Edit-Application-Title-HelpFile-Icon-Theming

    20140430-Fiddler-Filter-Actions-Button-Run-Filterset-now

    20140424-Windows-7-free-disk-space

    More Photos
  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,336 other followers

XML and HTML escapes

Posted by Jeroen Pluimers on 2012/07/26

While reviewing some client’s code, I noticed they were generating and parsing XML and HTML by hand (do not ever do that yourself!).

Before refactoring this into something that uses libraries that properly understand XML and HTML, I needed to assess some of the risks.

A major risk is to get the escaping (and unescaping) of XML and HTML right.

Time to finally organize some of my links on escaping HTML and XML that I had in my favourites list.

The starting point is the List of XML and HTML character entity references on Wikipedia. It is readable, complete and lists both kinds of escapes.

XML escapes

The official W3C text that describes XML escaping is hard to read.

There are only 5 predefined XML entities for characters that can (some must) be escaped. This table is derived from the Wikipedia article.

Name Character Unicode code point
(decimal)
Standard When to escape (from the XML 1.0 standard) Description
quot U+0022 (34) XML 1.0 To allow attribute values to contain both single and double quotes double quotation mark
amp & U+0026 (38) XML 1.0 Outside  comment, a processing instruction, or a CDATA section ampersand
apos U+0027 (39) XML 1.0 To allow attribute values to contain both single and double quotes apostrophe (= apostrophe-quote)
lt < U+003C (60) XML 1.0 Outside  comment, a processing instruction, or a CDATA section less-than sign
gt > U+003E (62) XML 1.0 in content, when that string is not marking the end of a CDATA section greater-than sign

HTML escapes

There are many escapes (252 as of the HTML 4 DTD and 253 when you are using XHTML), mainly because of character set encoding issues (basically that is caused by software that sucks^H^H^H^H^fails to properly render, transferring textual data, etc).

Only a few are characters are required to escape as per the W3C lists for HTML 5 named characters, XHTML 1 special charactersHTML 4 entities, HTML 3 entities,  HTML 2 entities, and the (non W3C) list of HTML 4 and XHTML 1 entities:

  • 4 when you use HTML (quot, amp, lt and gt)
  • 5 when you use XHTML (the ones for XML: the HTML ones plus apos)

This is the complete list:

Name Character Unicode code point
(decimal)
Standard DTD[a] Old ISO subset[b] Description[c]
quot U+0022 (34) HTML 2.0 HTMLspecial ISOnum quotation mark (= APL quote)
amp & U+0026 (38) HTML 2.0 HTMLspecial ISOnum ampersand
apos U+0027 (39) XHTML 1.0 HTMLspecial ISOnum apostrophe (= apostrophe-quote); see below
lt < U+003C (60) HTML 2.0 HTMLspecial ISOnum less-than sign
gt > U+003E (62) HTML 2.0 HTMLspecial ISOnum greater-than sign

Note it is best to escape &apos ; as &39; for backward compatibility (when emitting XHTML, be aware of other backward compatibility issues too).

Handling XML and HTML escapes

Not all software escapes HTML or XML the same. For instance WordPress uses two kinds of escaping for the HTML predefined entities.

For instance, in their categories tree view, they escape all of them using their names, except for the apostrophe (‘), which they escape as using a Unicode code point with a leading zero: ‘ for backward compatibility as per XHTML recommendation on &apos;.

But in their categories drop down combo box, they wrongly escape the double quote as  ” , the apostrophe as  ‘ (the others are OK: they are escaped using their names).

The reason is that their categories editor tries to outsmart you and behind your back makes these replacements (they even confuse left and right!):

  • the double quote Unicode character 34 (the ” quot quotation mark, U+0022) with the Unicode character 8221 (the ” rdquo right double quotation mark U+201D)
  • the apostrophe Unicode character 39 (the ‘ apos apostrophe, U+0027) with the Unicode character 8216 (the ‘ lsquo left single quotation mark U+2018)

Of course WordPress did never react on my tweets about this.

–jeroen

One Response to “XML and HTML escapes”

  1. [...] HTML is not fully accurate (see my post on HTML and XML escapes from last week), but it is fairly easy to [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

Join 1,336 other followers

%d bloggers like this: