The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,862 other subscribers

Archive for the ‘XML’ Category

Online XML Pretty Print

Posted by jpluimers on 2014/04/22

On my list of on-line tools: XML Pretty Print.

–jeroen

Posted in Development, Software Development, XML, XML/XSD | Leave a Comment »

A Generalised and Comprehensive Solution to CSV to XML and XML to CSV Transformations – Pascaliburnus

Posted by jpluimers on 2013/02/21

Two interesting XSLTs:

  • XSLT to parse CSV into XML,
  • XSLT to transform XML to CSV

–jeroen

via: A Generalised and Comprehensive Solution to CSV to XML and XML to CSV Transformations – Pascaliburnus.

Posted in CSV, Development, Software Development, XML, XML/XSD, XSD, XSLT | Leave a Comment »

.NET/C#: Generating a WordPress posting categories page – part 2

Posted by jpluimers on 2012/08/22

In Generating a WordPress posting categories page – part 1, you learned how to

  1. get the HTML with all the category information from your WordPress.com blog,
  2. convert that to XHTML,
  3. generate an XSD for the XHTML,
  4. generate C# classes from that XSD

This episode, you will learn how the data read from the XHTML can be transformed to a simple tree in HTML suited for a posting categories page like mine.

In the final post, the same data will be transferred into a real category cloud with font sizes indicating the frequency of the category usage.

From there, you can go into other directions (for instance generating squarified treemaps from the data).

That’s the cool thing about data: there are many ways to visualize, and this series was meant – next to some groundwork on how to get the data – as inspiration into some forms of visualization.
Hope you had fun reading it!

Getting a HTML tree from the optionType items

                StringBuilder outputHtml = new StringBuilder();
                string rootUrl = args[1];
                foreach (optionType item in select.option)
                {
                    if (item.Level == optionType.RootLevel)
                        continue;

                    // <a style="font-size: 100.3986332574%; padding: 1px; margin: 1px;" title="XML/XSD (23)" href="https://wiert.me/category/development/xmlxsd/">XML/XSD</a>
                    string url = String.Format("{0}/category{1}", rootUrl, item.HtmlPath);
                    string prefix = new string('.', item.Level * 5);// optionType.NbspEscaped.Repeat(item.Level);
                    outputHtml.AppendFormat("{0}<a title="{2}" href="{1}">{2} ({3})</a>", prefix, url, item.Category, item.Count);
                    outputHtml.AppendLine();
                }

One way of generating an HTML tree, is to prefix every node with a series of dots corresponding with the level of that node. Not the most pretty sight, but it will suffice for this episode.

Inside each node, I want to show the Category and Count.

Since the optionType as generated from the XSD only contains the below properties, a major portion on this posting is how to decode the Value so we can generate HTML like this:

...............<a href='https://wiert.me/category/development/software-development/net/c-' title='C#'>C#&nbsp;(118)</a>
....................<a href='https://wiert.me/category/development/software-development/net/c-/c--2-0' title='C# 2.0'>C# 2.0&nbsp;(46)</a>
....................<a href='https://wiert.me/category/development/software-development/net/c-/c--3-0' title='C# 3.0'>C# 3.0&nbsp;(33)</a>
....................<a href='https://wiert.me/category/development/software-development/net/c-/c--4-0' title='C# 4.0'>C# 4.0&nbsp;(31)</a>
....................<a href='https://wiert.me/category/development/software-development/net/c-/c--5-0' title='C# 5.0'>C# 5.0&nbsp;(2)</a>

Decoding the optionType Value property

optionType only contains the these properties:

  1. class
    • the class used to reference the style in the stylesheet
    • example value: “level-4”
  2. value
    • internal unique WordPress ID for the category (this allows you to alter the Slug and Value without breaking the links between posts and categories
    • example value: “45149061”
  3. Value
    • string that WordPress uses to make the category combobox look like a tree structure
    • example value: “&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;C# 5.0&nbsp;&nbsp;(2)”

The extra properties needed for the HTML generation logic above are these:

  1. Category
    • the Value undone from leading non breaking space character escapes, and the trailing count information
    • example value: C# 5.0
  2. Count
    • the Value undone from leading non breaking space character escapes, Caption information, separator non breaking space character escapes, and surrounding parenthesis
    • example value: 2
  3. Level
    • the class undone from the level- prefix
    • example value: 4
  4. Slug
    • the category slug is the unique value for a category that WordPress uses to form category urls. It is auto-generated from the Category, but you can also edit it. I don’t, as it is not in the combobox HTML, so I derive it from the Category. Note there are also posting slugs used in the permalink of each post.
    • example value: c--5-0 (it consists of lowercase letters and hyphens derived from the Category)
  5. HtmlPath
  6. parent (used internally for making the HtmlPath code much easier

The really cool thing about XSD2Code is that it generated the optionType C# code as a partial class.
Which means we can extend the generated partial classes in a seperate C# file like the code fragments below (you can download it from the WordPressCategoriesDropDown.cs file at BeSharp.CodePlex.com)

    partial class optionType
    {
        public const int RootLevel = -1;

        private const string slash = "/";
        private const char hyphen = '-';
        public const string NbspEscaped = "&nbsp;";
        private const string emptyCountInParenthesis = "(-1)";

        public optionType parent { get; set; }

        public string Category
        {
            get
            {
                string result;
                string countInParenthesis;
                splitValue(out result, out countInParenthesis);
                return result;
            }
        }

        public int Count
        {
            get
            {
                string category;
                string countInParenthesis;
                splitValue(out category, out countInParenthesis);
                string count = countInParenthesis.Substring(1, countInParenthesis.Length - 2);
                int result = int.Parse(count);
                return result;
            }
        }

        public int Level
        {
            get
            {
                if (string.IsNullOrWhiteSpace(@class))
                    return RootLevel;
                string[] split = @class.Split(hyphen);
                string number = split[1];
                int result = int.Parse(number);
                return result;
            }
        }

        /// <summary>
        /// This is the HTML part that WordPress uses to reference a Category
        /// </summary>
        public string Slug
        {
            get
            {
                StringBuilder result = new StringBuilder();
                foreach (char item in Category)
                {
                    if (char.IsLetterOrDigit(item))
                        result.Append(item.ToString().ToLower());
                    else
                        if (result.Length > 0)
                            result.Append(hyphen);
                }
                return result.ToString();
            }
        }

        public string HtmlPath
        {
            get
            {
                if (RootLevel == Level)
                    return string.Empty;

                string result = Slug;
                if (null != parent)
                    result = parent.HtmlPath + slash + result;
                return result;
            }
        }

        private void splitValue(out string category, out string countInParenthesis)
        {
            // might want to do this using RegEx, but that is a write-only language https://wiert.me/category/development/software-development/regex/
            string result = Value;
            int nbspCountToStripFromLeftOfValue = Level * 3; // strip 3 &nbsp; for each Level
            for (int i = 0; i < nbspCountToStripFromLeftOfValue; i++)
            {
                int nbspEscapedLength = NbspEscaped.Length;
                if (result.StartsWith(NbspEscaped))
                    result = result.Substring(nbspEscapedLength, result.Length - nbspEscapedLength);
            }
            string doubleNbspEscaped = NbspEscaped + NbspEscaped;
            if (result.Contains(doubleNbspEscaped))
            {
                string[] separators = new string[] { doubleNbspEscaped };
                string[] split = result.Split(separators, StringSplitOptions.None);
                category = split[0];
                countInParenthesis = split[1];
            }
            else
            {
                category = result;
                countInParenthesis = emptyCountInParenthesis;
            }
        }

        public override string ToString()
        {
            string result = string.Format("Level {0}, Count {1}, Slug {2}, HtmlPath {3}, Category '{4}'", Level, Count, Slug, HtmlPath, Category);
            return result;
        }
    }

The bulk of the above code is in the splitValue method (that could have used RegEx, but I try to avoid RegEx when I can do without it).
Note that the HtmlPath propererty uses the parent property. Without it, the HtmlPath code would have been very complex. The value of the parent properties for all optionType instances is generated in the selectType.FixParents method below since the selectType instance contains all the optionType instances in its’ option property.

    partial class selectType
    {
        public void FixParents()
        {
            Stack<optionType> itemStack = new Stack<optionType>();
            optionType parent = null;
            int previousLevel = optionType.RootLevel;

            foreach (optionType item in option)
            {
                int itemLevel = item.Level;
                if (itemLevel == previousLevel)
                {
                    if (optionType.RootLevel != itemLevel)
                    {
                        itemStack.Pop();
                        item.parent = parent;
                    }
                    itemStack.Push(item);
                }
                else
                {
                    if (itemLevel > previousLevel)
                    {
                        parent = itemStack.Peek();
                    }
                    else
                    {
                        do
                        {
                            itemStack.Pop();
                            parent = itemStack.Peek();
                            previousLevel = parent.Level;
                        }
                        while (previousLevel >= itemLevel);
                    }
                    itemStack.Push(item);
                    item.parent = parent;
                    previousLevel = itemLevel;
                }
            }
        }
    }

–jeroen

Posted in .NET, C#, C# 4.0, C# 5.0, Development, LINQ, Software Development, Usability, User Experience (ux), Web Development, WordPress, WordPress, XML, XML escapes, XML/XSD, XSD | 4 Comments »

Need some help: parsing almost well formed XML fragments: how to skip over multiple XML headers – Stack Overflow

Posted by jpluimers on 2012/08/16

If anyone knows a better solution than string search/replace, please let me know:

I’m required to write a tool that can handle the below XML fragment that is not well formed as it contains XML declarations in the middle of the stream.

The company already has these kinds files in use for a long time, so there is no option to change the format.

There is no source code available that does the parsing, and the platform of choice for new tooling is .NET 4 or newer preferably with C#.

This is how the fragments look like: Read the rest of this entry »

Posted in .NET, .NET 2.0, .NET 3.0, .NET 3.5, .NET 4.0, C#, C# 2.0, C# 3.0, C# 4.0, Development, Software Development, XML, XML/XSD | Leave a Comment »

.NET/C#: Generating a WordPress posting categories page – part 1

Posted by jpluimers on 2012/07/31

From the category cloud it is hard to see that the categories are organized as a hierarchy. The combobox on the right shows that, but does not have room to properly show the hierarchy. Since WordPress.com does not allow you to deploy your own code, I worked around it in this way using a small .NET C# console program:

  1. Extract the HTML for the All Categories combobox on the right of the page.
  2. Convert that HTML to XHTML (and therefore XML)
  3. Generate XSD from that XML
  4. Generate C# class wrappers from the XSD

Future posts will show more logic on how to handle the imported information, and generate nice category overviews. Preliminary source code is at the BeSharp.net source repository.

Extract the HTML

The HTML is not fully accurate (see my post on HTML and XML escapes from last week), but it is fairly easy to extract. Most web browsers allow you to view the source of your web page. Do that, then search for “All Categories”. Now you see HTML like this:

</pre>
<h2 class="widgettitle">All categories</h2>
<pre><select class="postform" name="cat"><option value="-1">Select Category</option></select><select class="postform" name="cat"><option class="level-0" value="256">About  (66)</option></select><select class="postform" name="cat"><option class="level-1" value="64">   Personal  (60)</option></select><select class="postform" name="cat"><option class="level-2" value="20254983">      Adest Musica  (7)</option></select><select class="postform" name="cat"><option class="level-2" value="32122">      Certifications  (2)</option></select><select class="postform" name="cat">...</select><select class="postform" name="cat"><option class="level-0" value="756">Comics  (3)</option></select><select class="postform" name="cat"><option class="level-0" value="780">Development  (473)</option></select><select class="postform" name="cat"><option class="level-1" value="872460">   Database Development  (55)</option></select><select class="postform" name="cat">...</select><select class="postform" name="cat"><option class="level-0" value="9280">User Experience  (3)</option></select>

I don’t need the H2 heading line, but the rest I do need to generate XML from. I saved the HTML into a text file for processing by the console app.

Convert the HTML to XML

The HTML contains loads of &nbsp;, but XML does not allow for that entity. So the & ampersand needs to be escaped into &amp;This also solves other uses of & in the HTML. The rest of the HTML is XHTML compliant, so does not require change, which results into this C# conversion method:

        private static string toXml(string inputHtml)
        {
            string result = inputHtml.Replace("&", "&");
            return result;
        }

Generate an XSD for the XML, then amend the XSD

Given my comparison of tools for generating XSD from XML, so I used the XmlForAsp XML Schema generator, with the “Separate Complex Types” option. (Note: I will link to the XSD before/after, as WordPress – yet again – screws the XSD sourcecode in the post; this should do for now). That gives me XSD like this (XML is also at pastebin):

<?xml version="1.0" encoding="utf-8"?>
<xsd:schema attributeFormDefault="unqualified" elementFormDefault="qualified" version="1.0" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
 <xsd:element name="select" type="selectType" />
 <xsd:complexType name="selectType">
  <xsd:sequence>
   <xsd:element maxOccurs="unbounded" name="option" type="optionType" />
  </xsd:sequence>
  <xsd:attribute name="name" type="xsd:string" />
  <xsd:attribute name="id" type="xsd:string" />
  <xsd:attribute name="class" type="xsd:string" />
 </xsd:complexType>
 <xsd:complexType name="optionType">
  <xsd:attribute name="value" type="xsd:int" />
 </xsd:complexType>
</xsd:schema>

Which is not complete, but gives a good start. The actual XSD it needs to be like this with a more elaborate optionType complex type that also defines it’s own content as deriving from xsd:string, and adds the class attribute (XML is also at pastebin):

<?xml version="1.0" encoding="utf-8"?>
<xsd:schema attributeFormDefault="unqualified" elementFormDefault="qualified" version="1.0" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
 <xsd:element name="select" type="selectType" />
 <xsd:complexType name="selectType">
  <xsd:sequence>
   <xsd:element maxOccurs="unbounded" name="option" type="optionType" />
  </xsd:sequence>
  <xsd:attribute name="name" type="xsd:string" />
  <xsd:attribute name="id" type="xsd:string" />
  <xsd:attribute name="class" type="xsd:string" />
 </xsd:complexType>
 <xsd:complexType name="optionType">
  <xsd:simpleContent>
  <xsd:extension base="xsd:string">
   <xsd:attribute name="class" type="xsd:string" />
   <xsd:attribute name="value" type="xsd:int" />
  </xsd:extension>
 </xsd:simpleContent>
 </xsd:complexType>
</xsd:schema>

Generate C# classes from the XSD

You can generate C# wrapper classes using the XSD.exe tool that ships with Visual Studio, but XSD.exe is hard to use, is hard to integrate into Visual Studio (despite Microsoft Connect request for it), the XSD.exe generated code still needs work for deserializing, and XSD.exe has very limited generation options (heck, after it changed from .NET 1.x to 2.0, it hasn’t been updated for about a decade). XSD2Code has some great reviews, to I used that in stead. And indeed, very well integrates into Visual Studio 2010, and generates very nice C#, especially when you use the options (see also the screenshot on the right):

  • Under Serialization, set Enabled to True
  • Under Serialization, set GenerateXmlAttributes to True

That way, loading the HTML, converting it to XML, then deserializing it into object instances is as simple as this:

                string inputFileName = args[0];
                string inputHtml = getHtml(inputFileName);
                string xml = toXml(inputHtml);
                selectType select = selectType.Deserialize(xml);

More on actually working with the loaded instances in the next episode, including the great benefit of XSD2Code: it generates C# code as partial classes.

–jeroen

Posted in .NET, C#, C# 4.0, C# 5.0, Development, SocialMedia, Software Development, Usability, User Experience (ux), Web Development, WordPress, WordPress, XML, XML escapes, XML/XSD, XSD | 2 Comments »

XML and HTML escapes

Posted by jpluimers on 2012/07/26

While reviewing some client’s code, I noticed they were generating and parsing XML and HTML by hand (do not ever do that yourself!).

Before refactoring this into something that uses libraries that properly understand XML and HTML, I needed to assess some of the risks.

A major risk is to get the escaping (and unescaping) of XML and HTML right.

Time to finally organize some of my links on escaping HTML and XML that I had in my favourites list.

The starting point is the List of XML and HTML character entity references on Wikipedia. It is readable, complete and lists both kinds of escapes.

XML escapes

The official W3C text that describes XML escaping is hard to read.

There are only 5 predefined XML entities for characters that can (some must) be escaped. This table is derived from the Wikipedia article.

Name Character Unicode code point
(decimal)
Standard When to escape (from the XML 1.0 standard) Description
quot U+0022 (34) XML 1.0 To allow attribute values to contain both single and double quotes double quotation mark
amp & U+0026 (38) XML 1.0 Outside  comment, a processing instruction, or a CDATA section ampersand
apos U+0027 (39) XML 1.0 To allow attribute values to contain both single and double quotes apostrophe (= apostrophe-quote)
lt < U+003C (60) XML 1.0 Outside  comment, a processing instruction, or a CDATA section less-than sign
gt > U+003E (62) XML 1.0 in content, when that string is not marking the end of a CDATA section greater-than sign

HTML escapes

Read the rest of this entry »

Posted in " quot, & amp, > gt, < lt, ' apos, ASCII, Development, Encoding, HTML, Power User, SocialMedia, Software Development, Unicode, Web Development, WordPress, XML, XML escapes, XML/XSD | 1 Comment »

Some on-line XPath testing tools that support namespaces and XPath functions

Posted by jpluimers on 2012/05/24

In the previous post, you saw some XPath to find the ProviderManifestToken attribute in an .edmx file.

Writing XPath queries can be quite consuming, so it is nice of you can test it somewhere on-line.

Out of the on-line XPath testers, these two worked best:

Both of them support namespaces, including these XPath functions: local-name and namespace-uri.

These failed:

If you have more on-line tools that work: please let me know!

Edit (20120524T0930):

Danny Thorpe tweeted an interesting comment:

@jpluimers If you’re compiling XPath support, make note of whether it’s XPATH 1.0 or 2.0. I think .NET only implements XPATH 1.0

Indeed, .NET only supports XPath 1.0, and worse, .NET 4 introduced a couple of issues with XSLT (see XPath and XSLT 2.0 for .NET? – Stack Overflow).

Luckily there are some goot XPath 2.0 libraries for .NET (see XPath 2.0 Libraries for .Net – Stack Overflow).

–jeroen

via: XPath tools on not-another-blog

Posted in Development, Software Development, XML, XML/XSD, XPath | Leave a Comment »

Excel XML Spreadsheet: going the XSLT way

Posted by jpluimers on 2011/09/01

Because of Excel XML Spreadsheet: Date.Type is mandatory :), I researched a bit more about how people use the XML Spreadsheet format.

I found that Technobabble is using XSL to generate Excel XML Spreadsheets.

Be patient following the above link; it takes a while to completely load, and while loading the style sheet shows black text on a dark grey background :) When loaded completely, the background becomes white.

The article talks more about styles and formulas (XML Spreadsheets store formulas in relative R1C1 style notation which is far easier to parse than the absolute A1 style notation, which most people use).

It seems to imply that for a ss:Column, a ss:AutoFitWidth of 1 (True) combined with a specified ss:Width will autosize text values, but the ss:Column documentation is right: “We do not autofit textual values”.

The default Excel 2003 font has a Width of about 7 points per character, so you can use that as a base to calculate ss:Width values for a column.

–jeroen

Posted in Development, Software Development, XML, XML/XSD, XSD | Leave a Comment »

Excel XML Spreadsheet: Styles and formatting

Posted by jpluimers on 2011/08/25

I found some time to continue my series that started with Excel XML Spreadsheet: Date.Type is mandatory :)

This time, it is about Styles and using the styles to format. I’ll limit myself to formatting Columns, but you can equally apply this to individual Cells, Rows, and Tables.

Note that in the below XML listings, I have replaced the angle brackets with { and }, because the WordPress editor will otherwise delete the XML from the sourcecode portions.

First, let’s look at some ss:Styles:

 {Styles}
  {Style ss:ID="Default" ss:Name="Normal"}
   {Alignment ss:Vertical="Bottom"/}
  {/Style}
  {Style ss:ID="s21"}
   {NumberFormat ss:Format="yyyy/mm/dd"/}
  {/Style}
  {Style ss:ID="s22"}
   {NumberFormat ss:Format="yyyy/mm/dd\ h:mm:ss"/}
  {/Style}
  {Style ss:ID="s31"}
   {NumberFormat ss:Format="[ENG][$-409]ddd"/}
  {/Style}
  {Style ss:ID="s32"}
   {NumberFormat ss:Format="[$-F800]dddd\,\ mmmm\ dd\,\ yyyy"/}
  {/Style}
 {/Styles}

Then the usage of the styles in Columns:

   {Column ss:StyleID="s21" ss:Width="53.25"/}
   {Column ss:Index="4" ss:StyleID="s31" ss:Width="89.25"/}
   {Column ss:StyleID="s22" ss:Width="95.25"/}
   {Column ss:StyleID="s32" ss:Width="95.25"/}

First a few remarks about the ss:Styles:

  1. Styles have IDs, which don’t need to be in the form s##, you can use any unique ID for them. Excel uses s## because that’s how the formatting pick-list works.
  2. You specify the formatting as a ss:NumberFormat using the components from the Creating international number formats documentation.
  3. You can add an Excel specific LCID (locale identifier) to a format. Without it, it will use the user’s locale settings.
  4. You can ommit the language hint (like [ENG]) from the formatting.
  5. The Excel LCID is very similar to the LCID Structure using hexadecimal values from the (old now defunct Locale ID Chart and replaced by the new) Microsoft Locale ID Values,  Language Identifier Constants and Strings table or list of Locale IDs Assigned by Microsoft, but with a few twists.
  6. There is a lot of confusion about [$-F800] and [$-F400] which actually behaves as LANG_SYSTEM_DEFAULT (0x0800 in the latter table), where [$-F800] displays the long date and [$-F400] displays the time (as correctly identified in this Openoffice Bugzilla bug report – or the Google cache of it).
  7. Three digit language [$-409] should be extended to 4-digit LCID 0x0409. It will format the cell using that specific language (in this case: English 3-letter weekday abbreviation).

A few remarks about the ss:Columns:

  1. These columns define formatting for column A, D, E and F.
  2. You don’t need to have a definition for every column in your Worksheet.Table, just for the ones that need formatting.
  3. The Column definition is smart: it can be sparse! After each gap, define a column having an ss:Index attribute, then continue defining subsequent columns until you need another gap.
  4. You can ommit the ss:Width attribute: when empty, the column will auto-size

–jeroen

PS: Rob van Gelder posted a nice formula to show nice translations using Excel formatting.

Posted in Development, internatiolanization (i18n) and localization (l10), Software Development, XML, XML/XSD, XSD | Leave a Comment »

Excel XML Spreadsheet: watch your Worksheet.Table.ExpandedColumnCount (or just leave it out)

Posted by jpluimers on 2011/08/11

Recently, I blogged about the Excel XML Spreadsheet: Date.Type is mandatory.

Here is another error you can get:

XML ERROR in Worksheet Setting
REASON:	Bad Value
FILE:	V:\CAS400\docs\statistics.cascall-logs\cascall.JU03HV.Bobbe, P.J.L. (Paul).20110620.xml
GROUP:	Worksheet
TAG:	Table
ATTRIB:	ExpandedColumnCount
VALUE:	9

This basically means your Worksheet.Table.ExpandedColumnCount is too small to hold your spreadsheet.

In other words: you either have more Worksheet.Table.Column definitions than ExpandedColumnCount, or you have a WorkSheet.Table.Row that has more WorkSheet.Table.Row.Cell entries than the ExpandedColumnCount.

In my case, I made an error generating too many Column definitions (by accidentaly duplicating some of them), and invalidated the ExpandedColumnCount.

Lesson learned: Since Excel will automatically calculate an absent ExpandedColumnCount, you can safely leave it out.

–jeroen

Posted in Development, Software Development, XML, XML/XSD, XSD | Leave a Comment »