The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,862 other subscribers

Archive for the ‘Encoding’ Category

Delphi and C# compiler oddities

Posted by jpluimers on 2013/01/08

When developing in multiple languages, it sometimes is funny to see how they differ in compiler oddities.

Below are a few on const examples.

Basically, in C# you cannot go from a char const to a string const, and chars are a special kind of int.

In Delphi you cannot go from a string to a char. Read the rest of this entry »

Posted in .NET, ASCII, C#, C# 1.0, C# 2.0, C# 3.0, C# 4.0, C# 5.0, Delphi, Delphi 2009, Delphi 2010, Delphi XE, Delphi XE2, Development, Encoding, Software Development, Unicode | Leave a Comment »

If you think CSV is easy; think again!

Posted by jpluimers on 2012/12/05

Lots of people think CSV is easy: it’s just a bunch of values separated with commas. But in practice it is not. Various reasons can make CSV very hard, especially since “CSV” is not a single, well-defined format. As always importing is always harder than exporting. A few reasons that make it hard:

A few links that helped me a lot getting input and output of CSV right in C#:

Thanks to Jabulaza:

–jeroen

via: Comma-separated values – Wikipedia, the free encyclopedia.

Posted in Conference Topics, Conferences, CSV, Development, EBCDIC, Encoding, Event, Software Development | 4 Comments »

.NET/C# duh moment of the day: “A char can be implicitly converted to ushort, int, uint, long, ulong, float, double, or decimal (not the other way around; implicit != implicit)”

Posted by jpluimers on 2012/11/20

A while ago I had a “duh” moment while calling a method that had many overloads, and one of the overloads was using int, not the char I’d expect.

The result was that a default value for that char was used, and my parameter was interpreted as a (very small) buffer size. I only found out something went wrong when writing unit tests around my code.

The culprit is this C# char feature (other implicit type conversions nicely summarized by Muhammad Javed):

A char can be implicitly converted to ushort, int, uint, long, ulong, float, double, or decimal. However, there are no implicit conversions from other types to the char type.

Switching between various development environments, I totally forgot this is the case in languages based on C and Java ancestry. But not in VB and Delphi ancestry  (C/C++ do numeric promotions of char to int and Java widens 2-byte char to 4-byte int; Delphi and VB.net don’t).

I’m not the only one who was confused, so Eric Lippert wrote a nice blog post on it in 2009: Why does char convert implicitly to ushort but not vice versa? – Fabulous Adventures In Coding – Site Home – MSDN Blogs.

Basically, it is the C ancestry: a char is an integral type always known to contain an integer value representing a Unicode character. The opposite is not true: an integer type is not always representing a Unicode character.

Lesson learned: if you have a large number of overloads (either writing them or using them) watch for mixing char and int parameters.

Note that overload resolution can be diffucult enough (C# 3 had breaking changes and C# 4 had breaking changes too, and those are only for C#), so don’t make it more difficult than it should be (:

Below a few examples in C# and VB and their IL disassemblies to illustrate their differnces based on asterisk (*) and space ( ) that also show that not all implicits are created equal: Decimal is done at run-time, the rest at compile time.

Note that the order of the methods is alphabetic, but the calls are in order of the type and size of the numeric types (integral types, then floating point types, then decimal).

A few interesting observations:

  • The C# compiler implicitly converts char with all calls except for decimal, where an implicit conversion at run time is used:
    L_004c: call valuetype [mscorlib]System.Decimal [mscorlib]System.Decimal::op_Implicit(char)
    L_0051: call void CharIntCompatibilityCSharp.Program::writeLineDecimal(valuetype [mscorlib]System.Decimal)
  • Same for implicit conversion of byte to the other types, though here the C# and VB.NET compilers generate slightly different code for run-time conversion.
    C# uses an implicit conversion:
    L_00af: ldloc.1
    L_00b0: call valuetype [mscorlib]System.Decimal [mscorlib]System.Decimal::op_Implicit(uint8)
    L_00b5: call void CharIntCompatibilityCSharp.Program::writeLineDecimal(valuetype [mscorlib]System.Decimal)
    VB.NET calls a constructor:
    L_006e: ldloc.1
    L_006f: newobj instance void [mscorlib]System.Decimal::.ctor(int32)
    L_0075: call void CharIntCompatibilityVB.Program::writeLineDecimal(valuetype [mscorlib]System.Decimal)

Here is the example code: Read the rest of this entry »

Posted in .NET, Agile, Algorithms, C#, C# 1.0, C# 2.0, C# 3.0, C# 4.0, C# 5.0, C++, Delphi, Development, Encoding, Floating point handling, Java, Software Development, Unicode, Unit Testing, VB.NET | 1 Comment »

XML and HTML escapes

Posted by jpluimers on 2012/07/26

While reviewing some client’s code, I noticed they were generating and parsing XML and HTML by hand (do not ever do that yourself!).

Before refactoring this into something that uses libraries that properly understand XML and HTML, I needed to assess some of the risks.

A major risk is to get the escaping (and unescaping) of XML and HTML right.

Time to finally organize some of my links on escaping HTML and XML that I had in my favourites list.

The starting point is the List of XML and HTML character entity references on Wikipedia. It is readable, complete and lists both kinds of escapes.

XML escapes

The official W3C text that describes XML escaping is hard to read.

There are only 5 predefined XML entities for characters that can (some must) be escaped. This table is derived from the Wikipedia article.

Name Character Unicode code point
(decimal)
Standard When to escape (from the XML 1.0 standard) Description
quot U+0022 (34) XML 1.0 To allow attribute values to contain both single and double quotes double quotation mark
amp & U+0026 (38) XML 1.0 Outside  comment, a processing instruction, or a CDATA section ampersand
apos U+0027 (39) XML 1.0 To allow attribute values to contain both single and double quotes apostrophe (= apostrophe-quote)
lt < U+003C (60) XML 1.0 Outside  comment, a processing instruction, or a CDATA section less-than sign
gt > U+003E (62) XML 1.0 in content, when that string is not marking the end of a CDATA section greater-than sign

HTML escapes

Read the rest of this entry »

Posted in " quot, & amp, > gt, < lt, ' apos, ASCII, Development, Encoding, HTML, Power User, SocialMedia, Software Development, Unicode, Web Development, WordPress, XML, XML escapes, XML/XSD | 1 Comment »

Searchable UTF8 Unicode Characters

Posted by jpluimers on 2012/05/31

Brilliant: site where you can Search UTF8 Unicode Characters.

If you know part of the name of a Unicode character, you can now try and find it, then copy/paste it from that site.

Edit: 20160403: that site disappeared, but this one works: Unicode Character Search and Shapecatcher.com: Unicode Character Recognition still works.

–jeroen

Posted in Development, Encoding, Power User, Software Development, Unicode, UTF-8, UTF8 | Leave a Comment »

SQL Server 2000 (and probably later) other reason for System.Data.SqlClient.SqlException: A severe error occurred on the current command. (via SQL Server Forums)

Posted by jpluimers on 2012/04/24

While transitioning from SQL Server 2000 to 2008, I recently had the “A severe error occurred on the current command. The results, if any, should be discarded.”  occurring on SQL Server 2000 in the form as shown at the bottom of this message.

Many of the search results point you into the area of atabase corruption, or in using NVARCAR parameters with SQL Server 2000 or SQL Server 2005 (the app didn’t use NVARCAR, nor did it use large VARCHAR parameters).

The cool thing on the SQL Server Forums – System.Data.SqlClient.SqlException: A severe error occurred on the current command post was that it summed up causes, and asked for more:

Posted – 06/17/2004 :  15:05:20

Rashid writes “Hi: Gurus I am getting these errors when I try to execute my application. According to MS knowledge base (http://support.microsoft.com/default.aspx?scid=kb;en-us;827366) these errors happen due to following resons

  1. You use a SqlClient class in a Finalize method or in a C# destructor.
  2. You do not specify an explicit SQLDbType enumeration when you create a SqlParameter object. When you do not specify an explicit SQLDbType, the Microsoft .NET Framework Data Provider for SQL Server (SqlClient) tries to select the correct SQLDbType based on the data that is passed. SqlClient is not successful.
  3. The size of the parameter that you explicitly specify in the .NET Framework code is more than the maximum size that you can use for the data type in Microsoft SQL Server.

None of these are true in my case. Are there any other reasons that can cause these problems..

There is one more: sending huge SQL Statements to your SQL Server is always a bad idea and gives this error too. Read the rest of this entry »

Posted in .NET, C#, C# 2.0, C# 3.0, C# 4.0, Database Development, Development, Encoding, Software Development, SQL Server, SQL Server 2000, SQL Server 2008 R2, Unicode | Leave a Comment »

Some words on Unicode in Windows (Delphi, .NET, APIs, etc)

Posted by jpluimers on 2012/04/05

O'Reilly book "Unicode Explained: Internationalize Documents, Programs, and Web Sites"

O'Reilly book "Unicode Explained: Internationalize Documents, Programs, and Web Sites"

Withe the growing integration between systems, and the mismatch between those that support Unicode and that do not, I find that a lot of organisations lack basic Unicode knowledge.

So lets put down a few things, that helps as a primer and gets some confusion out of the way.

Please read the article on Unicode by Joel on Software, and the book Unicode Explained. The book is from 1996, and still very valid.

Unicode

Unicode started in the late 80s of last century as a 16-bit character model.

Somehow lots of people still thing Unicode is a 16-bit double-byte character set. It is not. It uses a variable width encoding for storage.

All encodings except the 32-bit ones are variable width. The UTF-16 encoding is a variable width encoding where each code point (not character!, see below why) takes one or more 16-bit words.

This is because – as of Unicode version 2.0 in 1996 – a surrogate character mechanism was introduced to be able to have more than 64k code points.

The architecture of Unicode is completely different than traditional single-byte character sets or double-byte character sets.

In Unicode, there is a distinction between code points (the mapping of the character to an actual IDs), storage/encoding (in Windows now uses UTF-16LE which includes the past used UCS-2) and leaves visual representation (glyphs/renderings) to fonts.

Unicode has over a million code points, logically divided into 17 planes, of which the Basic Multi-lingual Plane has code points that can be encoded into one 16-bit word.

There is no font that can display all Unicode code points. By original aim, the first 256 Unicode code points are identical to the ISO 8859-1 character set (which is Windows-29591, not Windows-1252!) for which most fonts can display most characters.

I entity Unicode (Windows version)

By now, you probably grasp that Unicode is not an easy thing to get right. And that can be hard, hence people love and hate Unicode at the same time. Maybe I should get the T-Shirt :).

One thing that complexes things, is that Unicode allows for both composite characters and ready made composites. This is one form where different sequences can be equivalent, so there can be Unicode equivalence for which you need some knowledge on Unicode Normalization (be sure to read this StackOverflow question and this article by Michael Kaplan on Unicode Normalization).

There are many Unicode encodings, of which UTF-8 and UTF-16 are the most widely used (and are variable length). UTF-32 is fixed length. All 16-bit and 32-bit encodings can have big-endian and little-endian storage and can use a Byte Order Mark (BOM) to indicate their endinaness. Not all software uses BOMs, and there are BOMs for UTF-8 and other encodings as well (for UTF-8 it is not recommended to include a BOM).

When only parts your development environment supports Unicode strings, you need to be aware of which do and which don’t. For any interface boundary between those, you need to be aware of potential data loss, and need to decide how to cope with that.

For instance, does your database use Unicode or not for character storage? (For Microsoft SQL Server: do you use CHAR/VARCHAR or NCHAR/NVARCHARyou should aim for NVARCHAR, yes you really should, do not use text, ntext and image). What do you do while transferring Unicode and non-Unicode text to it? Ask the same questions for Web Services, configuration files, binary storage, message queueing and various other interfaces to the outside world.

The Windows API is almost exclusively Unicode (see this StackOverflow question for more details)

Delphi and Unicode

Let’s focus a bit on Delphi now, as that the migration towards Unicode at clients raised a few questions over the last couple of months.

One of the key questions is why there are no conversion tools that help you migrate your existing source code to fully embrace Unicode.

The short answer is: because you can’t automate the detection of intent in your codebase.

The longer answer starts with that there are tools that detect parts of your Delphi source that potentially has problems: the compiler hints, warnings and errors that brings your attention to spots that are fishy, are likely to fail, or are plain wrong.

Delphi uses the standard Windows storage format for Unicode text: UTF-16LE.

Next to that, Delphi supports conversion to and from UTF-8 en UTF-32 (in their various forms endianness).

External storage of text is best done as UTF-8 because it doesn’t have endianness, and because of easier exchange of text in ISO-8859-1.

Marco Cantu wrote a very nice whitepaper about Delphi and Unicode, and I did a Delphi Unicode talk at CodeRage 4 and posted a lot of Delphi Unicode links at StackOverflow.

A few extra notes on Delphi and Unicode:

With Delphi string types, stick to the UnicodeString (default string as of Delphi 2009) and AnsiString (default string until Delphi 2007) as their memory management is done by Delphi. WideString management is done by COM, so only use that when you really need to. Also avoid ShortString.

For any interfaces to the external world, you need to decide which ones to keep to generic string, Char, PChar and which ones to fix to AnsiChar/PAnsiChar/AnsiString(+ accompanying codepage) or fix at UnicodeChar/PUnicodeChar/UnicodeString.

Of course remnants from the past will catch up with you: if you have Technical Debt on the past where characters were bytes, and you abused Char/PChar/array-of-char/etc you need to fix that, and use the Byte/PByte/TByteArray/PByteArray. It can be costly to pay the accrued debt on that.

–jeroen

PS:

Posted in .NET, C#, Delphi, Development, EBCDIC, Encoding, ISO-8859, Software Development, Technical Debt, Unicode, UTF-8 | 2 Comments »

Shapecatcher.com: Unicode Character Recognition

Posted by jpluimers on 2012/03/26

There are so many Unicode code points, that finding back one based on how a glyph looks like is similar to finding the needle in a haystack.

ShapeCatcher.com to the rescue: it has a drawing box where you can draw the glyph, and a recognition engine that presents you with Unicode code points that have a similar glyph.

Draw something in the left box!

And let shapecatcher help you to find the most similar unicode characters!

–jeroen

via: Shapecatcher.com: Unicode Character Recognition.

Posted in Development, Encoding, LifeHacker, Power User, Software Development, Unicode | 1 Comment »

P/Invoke: usually you need CharSet.Auto (via: .NET Column: Calling Win32 DLLs in C# with P/Invoke)

Posted by jpluimers on 2012/02/28

I don’t do P/Invoke often, and somehow I have trouble remembering the value of CharSet to pass with DllImport.

In short, pass CharSet.Auto unless you P/Invoke a function that is specific to CharSet.Ansi or CharSet.Unicode. The default is CharSet.Ansi, which you usually don’t want:

when Char or String data is part of the equation, set the CharSet property to CharSet.Auto. This causes the CLR to use the appropriate character set based on the host OS. If you don’t explicitly set the CharSet property, then its default is CharSet.Ansi. This default is unfortunate because it negatively affects the performance of text parameter marshaling for interop calls made on Windows 2000, Windows XP, and Windows NT®.

The only time you should explicitly select a CharSet value of CharSet.Ansi or CharSet.Unicode, rather than going with CharSet.Auto, is when you are explicitly naming an exported function that is specific to one or the other of the two flavors of Win32 OS. An example of this is the ReadDirectoryChangesW API function, which exists only in Windows NT-based operating systems and supports Unicode only; in this case you should use CharSet.Unicode explicitly.

–jeroen

via: .NET Column: Calling Win32 DLLs in C# with P/Invoke.

Posted in .NET, Ansi, C#, Delphi, Development, Encoding, Prism, Software Development, Unicode | 3 Comments »

Registry Search-Replace tools – correcting the havoc after a data migration

Posted by jpluimers on 2011/10/07

A while ago, I found myself in the situation where at a corporate client the user profiles had moved on the LAN. Very understandable: it was one of the migrations towards DFS. They notified this in advance, so I made backups of everything (home drive and user profile) just to make sure.

The move indeed caused all sorts of havoc, because the data was moved, but the registry was only slightly modified.

Some of the errors I got were like these:

[Internet Explorer - Search Provider Default]
A program on your computer has corrupted your default search provider setting for Internet Explorer.

Internet Explorer has reset this setting to your original search provider, Live Search (search.live.com).

Internet Explorer will now open Search Settings, where you can change this setting or install more search providers.
[OK]

and

[Desktop]
\\old-server\old-share\user-id\Desktop refers to a location that is unavailable. It could be on a hard drive on this computer, or on a network. Check to make sure that the disk is properly inserted, or that you are connected to the Internet or your network, and then try again. If it still cannot be located, the information might have been moved to a different location.
[OK]

Below some of the ramblings on what I did to get everything working again, including registry searches when you are not allowed to run RegEdit, searching through text, and the places in the registry that had to change. Read the rest of this entry »

Posted in .NET, C#, Delphi, Development, Encoding, Power User, Software Development, Unicode | Leave a Comment »