Decoding HTML encoded source to XML text
Posted by jpluimers on 2026/03/03
For Some links on getting the most recent defragmentation time of a Windows volume I needed to copy back and forth some XML code back and forth between my ARM MacBook Pro to a remote Windows machine accessing via the Microsoft Windows App (the app formerly known as Microsoft Remote Desktop for Mac).
The problem with that is the copying would lose line breaks, which for XML meaning is no problem, but for human understandability while editing the XML in the Event View query dialog was.
So I decided to go to the “Code” view in my Classic WordPress editor (did I ever tell you much I dislike – especially the accessibility of – the not so new but still haughty named Gutenberg editor?), copied the HTML encoded form and wanted to convert it to unencoded XML text.
Well, here I got to naming confusion land, on which I will talk further below, but first two of the potential solutions:
Note: the tools are in the below particular order, as I forgot about naming confusion and CyberChef by GCHQ could be used as well.
- [Wayback/Archive] HTML Decode / Unescape – Online Tools from [Wayback/Archive] Online Tools (their main page [Wayback/Archive] emn178’s Online Tools refers to it) with source code at
- [Wayback/Archive] HTML Decode Online is the Best Tool to Decode HTML String, HTML URL and HTML File. from [Wayback/Archive] Code Beautify and Code Formatter For Developers – to Beautify, Validate, Minify, JSON, XML, JavaScript, CSS, HTML, Excel and more of which I blogged first more than 10 years ago in Best Online XML Viewer, Formatter, Editor, Analyser, Beautify-Beautifier, Minify, Tree structure, Notepad, Marker
- [Wayback/Archive] CyberChef with the “From HTML Entity” example at [Wayback/Archive] From HTML Entity – CyberChef – the HTML Entity text to XML text conversion
- from
<QueryList> <Query Id="0" Path="Application"> <Select Path="Application"> *[System[Provider[@Name='Microsoft-Windows-Defrag'] and (Level=4 or Level=0) and (EventID=258)]] and *[EventData[Data[1]='defragmentation']] and *[EventData[Data[2]='(C:)']] </Select> </Query> </QueryList> - to
<QueryList> <Query Id="0" Path="Application"> <Select Path="Application"> *[System[Provider[@Name='Microsoft-Windows-Defrag'] and (Level=4 or Level=0) and (EventID=258)]] and *[EventData[Data[1]='defragmentation']] and *[EventData[Data[2]='(C:)']] </Select> </Query> </QueryList>
- from
All three tools completely work on the client side.
Both the first and last tools have source code on-line:
- [Wayback/Archive] GitHub – emn178/online-tools: Online tools provides md2, md5, sha1, sha2, sha512, bas64, html encode / decode functions
- [Wayback/Archive] GitHub – gchq/CyberChef: The Cyber Swiss Army Knife – a web app for encryption, encoding, compression and data analysis
Terminology
Above you see these terminologies:
- HTML encoded
- HTML decode
- HTML unescape
- HTML entity
The last one is the formal term, and the ones needing decoding were < and >. Their explanation is in List of XML and HTML character entity references: List of character entity references in HTML – Wikipedia from where I copied some table rows into this an HTML table with a proper <thead>:
All named character entity references in HTML and XML Entities Char. Codepoints Standard DTD[b] Old ISO subset[c] Description[d] … … … … … … … <[a]
<[a]< U+003C
- XML 1.0
- HTML 5.0
html.dtd
HTMLspecialISOnum less-than sign … … … … … … … … … … … … … … >[a]
>[a]> U+003E
- XML 1.0
- HTML 5.0
html.dtd
HTMLspecialISOnum greater-than sign … … … … … … …
What you see is that it points to the html.dtd which further below in that page in a table is described as
HTML DTD entities subsets Name Version Formal public identifier System identifier … … … … HTMLspecial HTML 4 "-//W3C//ENTITIES Special//EN//HTML""http://www.w3.org/TR/html4/HTMLspecial.ent"(optional)XHTML 1 "-//W3C//ENTITIES Special for XHTML//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent"html.dtd[i] N/A "http://info.cern.ch/MarkUp/html-spec/html.dtd"… … … …
where the [i] points to
- The original HTML 1.0 DTD, which would have been available at http://info.cern.ch/MarkUp/html-spec/html.dtd
It implies that html.dtd did not and does not exist: it was never created in the turmoil of the early HTML days, hence it’s system identifier page http://info.cern.ch/MarkUp/html-spec/html.dtd returns a HTTP 404.
As a side note, there is no DTD for HTML 5 either (due to billion laughs attacks).
The HTMLSpecial system identifiers do exist and have not change since early this century. Here they are with the respective entries quoted:
20000712: [Wayback/Archive] http://www.w3.org/TR/html4/HTMLspecial.ent
<!ENTITY lt CDATA "<" -- less-than sign, U+003C ISOnum --><!ENTITY gt CDATA ">" -- greater-than sign, U+003E ISOnum -->20020806: [Wayback/Archive] http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
<!ENTITY lt "<"> <!-- less-than sign, U+003C ISOnum --><!ENTITY gt ">"> <!-- greater-than sign, U+003E ISOnum -->
Queries
- html decoder – Google Search (could not be archived)
- [Wayback/Archive] html decoder at DuckDuckGo
- [Wayback/Archive] emn178.github.io at DuckDuckGo
--jeroen






Leave a comment