The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,419 other followers

Need some help: parsing almost well formed XML fragments: how to skip over multiple XML headers – Stack Overflow

Posted by jpluimers on 2012/08/16

If anyone knows a better solution than string search/replace, please let me know:

I’m required to write a tool that can handle the below XML fragment that is not well formed as it contains XML declarations in the middle of the stream.

The company already has these kinds files in use for a long time, so there is no option to change the format.

There is no source code available that does the parsing, and the platform of choice for new tooling is .NET 4 or newer preferably with C#.

This is how the fragments look like:

<Header>
  <Version>1</Version>
</Header>
<Entry><?xml version="1.0"?><Detail>...snip...</Detail></Entry>
<Entry><?xml version="1.0"?><Detail>...snip...</Detail></Entry>
<Entry><?xml version="1.0"?><Detail>...snip...</Detail></Entry>
<Entry><?xml version="1.0"?><Detail>...snip...</Detail></Entry>

Using an XmlReader with the XmlReaderSettings.ConformanceLevel set to ConformanceLevel.Fragment, I can read the complete <Header> element fine. Even the <Entry> element start is OK, however while reading the <Detail> info the XmlReader it throws an XmlException, as it reads in the <?xml...?>XML declaration which it doesn’t expect at that place.

The XML declarations might not the same for every entry.

What options do I have to skip over those XML declarations, besides heavy string manipulations?

Since the fragments can easily go above 100 megabyte a piece, I’d rather do not load everything into memory at once. But it that is what it takes, I am open for it.

Example of the exceptions I get:

System.Xml.XmlException:Unexpected XML declaration.
The XML declaration must be the first node in the document,and no white space characters are allowed to appear before it.
Line##, position ##.

–jeroen

via: c# – parsing almost well formed XML fragments: how to skip over multiple XML headers – Stack Overflow.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

 
%d bloggers like this: