XML and HTML escapes
Posted by jpluimers on 2012/07/26
While reviewing some client’s code, I noticed they were generating and parsing XML and HTML by hand (do not ever do that yourself!).
Before refactoring this into something that uses libraries that properly understand XML and HTML, I needed to assess some of the risks.
A major risk is to get the escaping (and unescaping) of XML and HTML right.
Time to finally organize some of my links on escaping HTML and XML that I had in my favourites list.
The starting point is the List of XML and HTML character entity references on Wikipedia. It is readable, complete and lists both kinds of escapes.
The official W3C text that describes XML escaping is hard to read.
There are only 5 predefined XML entities for characters that can (some must) be escaped. This table is derived from the Wikipedia article.
|Name||Character||Unicode code point
|Standard||When to escape (from the XML 1.0 standard)||Description|
|quot||“||U+0022 (34)||XML 1.0||To allow attribute values to contain both single and double quotes||double quotation mark|
|amp||&||U+0026 (38)||XML 1.0||Outside comment, a processing instruction, or a CDATA section||ampersand|
|apos||‘||U+0027 (39)||XML 1.0||To allow attribute values to contain both single and double quotes||apostrophe (= apostrophe-quote)|
|lt||<||U+003C (60)||XML 1.0||Outside comment, a processing instruction, or a CDATA section||less-than sign|
|gt||>||U+003E (62)||XML 1.0||in content, when that string is not marking the end of a CDATA section||greater-than sign|
There are many escapes (252 as of the HTML 4 DTD and 253 when you are using XHTML), mainly because of character set encoding issues (basically that is caused by software that sucks^H^H^H^H^fails to properly render, transferring textual data, etc).
Only a few are characters are required to escape as per the W3C lists for HTML 5 named characters, XHTML 1 special characters, HTML 4 entities, HTML 3 entities, HTML 2 entities, and the (non W3C) list of HTML 4 and XHTML 1 entities:
- 4 when you use HTML (quot, amp, lt and gt)
- 5 when you use XHTML (the ones for XML: the HTML ones plus apos)
This is the complete list:
|Name||Character||Unicode code point
|Standard||DTD[a]||Old ISO subset[b]||Description[c]|
|quot||“||U+0022 (34)||HTML 2.0||HTMLspecial||ISOnum||quotation mark (= APL quote)|
|amp||&||U+0026 (38)||HTML 2.0||HTMLspecial||ISOnum||ampersand|
|apos||‘||U+0027 (39)||XHTML 1.0||HTMLspecial||ISOnum||apostrophe (= apostrophe-quote); see below|
|lt||<||U+003C (60)||HTML 2.0||HTMLspecial||ISOnum||less-than sign|
|gt||>||U+003E (62)||HTML 2.0||HTMLspecial||ISOnum||greater-than sign|
Handling XML and HTML escapes
Not all software escapes HTML or XML the same. For instance WordPress uses two kinds of escaping for the HTML predefined entities.
For instance, in their categories tree view, they escape all of them using their names, except for the apostrophe (‘), which they escape as using a Unicode code point with a leading zero: ‘ for backward compatibility as per XHTML recommendation on '.
But in their categories drop down combo box, they wrongly escape the double quote as ” , the apostrophe as ‘ (the others are OK: they are escaped using their names).
The reason is that their categories editor tries to outsmart you and behind your back makes these replacements (they even confuse left and right!):
- the double quote Unicode character 34 (the ” quot quotation mark, U+0022) with the Unicode character 8221 (the ” rdquo right double quotation mark U+201D)
- the apostrophe Unicode character 39 (the ‘ apos apostrophe, U+0027) with the Unicode character 8216 (the ‘ lsquo left single quotation mark U+2018)
This entry was posted on 2012/07/26 at 06:00 and is filed under " quot, & amp, > gt, < lt, ' apos, ASCII, Development, Encoding, HTML, Power User, SocialMedia, Software Development, Unicode, Web Development, WordPress, XML, XML escapes, XML/XSD. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.