Decoding HTML encoded source to XML text
Posted by jpluimers on 2026/03/03
For Some links on getting the most recent defragmentation time of a Windows volume I needed to copy back and forth some XML code back and forth between my ARM MacBook Pro to a remote Windows machine accessing via the Microsoft Windows App (the app formerly known as Microsoft Remote Desktop for Mac).
The problem with that is the copying would lose line breaks, which for XML meaning is no problem, but for human understandability while editing the XML in the Event View query dialog was.
So I decided to go to the “Code” view in my Classic WordPress editor (did I ever tell you much I dislike – especially the accessibility of – the not so new but still haughty named Gutenberg editor?), copied the HTML encoded form and wanted to convert it to unencoded XML text.
Well, here I got to naming confusion land, on which I will talk further below, but first two of the potential solutions:
Note: the tools are in the below particular order, as I forgot about naming confusion and CyberChef by GCHQ could be used as well.
- [Wayback/Archive] HTML Decode / Unescape – Online Tools from [Wayback/Archive] Online Tools (their main page [Wayback/Archive] emn178’s Online Tools refers to it) with source code at
- [Wayback/Archive] HTML Decode Online is the Best Tool to Decode HTML String, HTML URL and HTML File. from [Wayback/Archive] Code Beautify and Code Formatter For Developers – to Beautify, Validate, Minify, JSON, XML, JavaScript, CSS, HTML, Excel and more of which I blogged first more than 10 years ago in Best Online XML Viewer, Formatter, Editor, Analyser, Beautify-Beautifier, Minify, Tree structure, Notepad, Marker
- [Wayback/Archive] CyberChef with the “From HTML Entity” example at [Wayback/Archive] From HTML Entity – CyberChef – the HTML Entity text to XML text conversion
- from
<QueryList> <Query Id="0" Path="Application"> <Select Path="Application"> *[System[Provider[@Name='Microsoft-Windows-Defrag'] and (Level=4 or Level=0) and (EventID=258)]] and *[EventData[Data[1]='defragmentation']] and *[EventData[Data[2]='(C:)']] </Select> </Query> </QueryList> - to
<QueryList> <Query Id="0" Path="Application"> <Select Path="Application"> *[System[Provider[@Name='Microsoft-Windows-Defrag'] and (Level=4 or Level=0) and (EventID=258)]] and *[EventData[Data[1]='defragmentation']] and *[EventData[Data[2]='(C:)']] </Select> </Query> </QueryList>
- from
All three tools completely work on the client side.
Both the first and last tools have source code on-line:
- [Wayback/Archive] GitHub – emn178/online-tools: Online tools provides md2, md5, sha1, sha2, sha512, bas64, html encode / decode functions
- [Wayback/Archive] GitHub – gchq/CyberChef: The Cyber Swiss Army Knife – a web app for encryption, encoding, compression and data analysis
Terminology
Above you see these terminologies:
- HTML encoded
- HTML decode
- HTML unescape
- HTML entity
The last one is the formal term, and the ones needing decoding were < and >. Their explanation is in List of XML and HTML character entity references: List of character entity references in HTML – Wikipedia from where I copied some table rows into this an HTML table with a proper <thead>:
All named character entity references in HTML and XML Entities Char. Codepoints Standard DTD[b] Old ISO subset[c] Description[d] … … … … … … … <[a]
<[a]< U+003C
- XML 1.0
- HTML 5.0
html.dtd
HTMLspecialISOnum less-than sign … … … … … … … … … … … … … … >[a]
>[a]> U+003E
- XML 1.0
- HTML 5.0
html.dtd
HTMLspecialISOnum greater-than sign … … … … … … …
What you see is that it points to the html.dtd which further below in that page in a table is described as
HTML DTD entities subsets Name Version Formal public identifier System identifier … … … … HTMLspecial HTML 4 "-//W3C//ENTITIES Special//EN//HTML""http://www.w3.org/TR/html4/HTMLspecial.ent"(optional)XHTML 1 "-//W3C//ENTITIES Special for XHTML//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent"html.dtd[i] N/A "http://info.cern.ch/MarkUp/html-spec/html.dtd"… … … …
where the [i] points to
- The original HTML 1.0 DTD, which would have been available at http://info.cern.ch/MarkUp/html-spec/html.dtd
It implies that html.dtd did not and does not exist: it was never created in the turmoil of the early HTML days, hence it’s system identifier page [Wayback/Archive] http://info.cern.ch/MarkUp/html-spec/html.dtd returns a HTTP 404.
As a side note, there is no DTD for HTML 5 either (due to billion laughs attacks).
The HTMLSpecial system identifiers do exist and have not change since early this century. Here they are with the respective entries quoted:
20000712: [Wayback/Archive] http://www.w3.org/TR/html4/HTMLspecial.ent
<!ENTITY lt CDATA "<" -- less-than sign, U+003C ISOnum --><!ENTITY gt CDATA ">" -- greater-than sign, U+003E ISOnum -->20020806: [Wayback/Archive] http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
<!ENTITY lt "<"> <!-- less-than sign, U+003C ISOnum --><!ENTITY gt ">"> <!-- greater-than sign, U+003E ISOnum -->
Wikipedia links
- Character encodings in HTML – Wikipedia
- List of XML and HTML character entity references – Wikipedia has sections
- HTML – Wikipedia has these sections
- HTML version timeline
- HTML draft version timeline
- XHTML versions (the former XHTML which is no longer developed)
Other links
I need to find where these links have moved to, as only the Wayback Machine and Archive.is links function; if a file looks unreadable, then browse back through its Wayback Machine history if a file looks unreadable – sometimes as far back as 2005 (encoding is a headache, and also the reason for the Mojibake below):
- [Wayback/Archive] HTML and XHTML Document Type Definitions has the
1. HTML 0 through HTML 1.0: Specification of 15 June 1995 [1]
- HTML Declaration
- [Wayback/Archive] html.decl
- DTDs (all require the base [Wayback/Archive] html.dtd in the entity catalog)
- [Wayback/Archive] html-0.dtd � – Level 0
[Wayback/Archive] html-1.dtd – Level 1 – “normal”
[Wayback/Archive] html-1s.dtd – Level 1s – “strict”
[Wayback/Archive] HTML Document Type Definitions has the same content but was only archived until 1999.
- [Wayback/Archive] HTML Document Type Definition References
- [Wayback/Archive] Index of /webdocs/HTMLdocs/HTML_Spec/HTML2-15Jun95
- [Wayback/Archive] http://lost-contact.mit.edu/afs/cern.ch/w3.org/www/MarkUp/Connolly/921125/archive.sh#html.dtd
# This is a shell archive. Remove anything before this line, # then unpack it by saving it in a file and typing "sh file". # # Wrapped by imagine!connolly on Tue Nov 24 21:02:36 CST 1992 # Contents: html.dtd spec.html recommended.html complete.html tolerated.html # supported.html deprecated.html errors.html
…
Queries
- html decoder – Google Search (could not be archived)
- [Wayback/Archive] html decoder at DuckDuckGo
- [Wayback/Archive] emn178.github.io at DuckDuckGo
- [Wayback/Archive] html 1.0 dtd at DuckDuckGo
- [Wayback/Archive] HTML and XHTML Document Type Definitions at DuckDuckGo
--jeroen






Leave a comment