The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,861 other subscribers

Archive for the ‘Unicode’ Category

Some words on Unicode in Windows (Delphi, .NET, APIs, etc)

Posted by jpluimers on 2012/04/05

O'Reilly book "Unicode Explained: Internationalize Documents, Programs, and Web Sites"

O'Reilly book "Unicode Explained: Internationalize Documents, Programs, and Web Sites"

Withe the growing integration between systems, and the mismatch between those that support Unicode and that do not, I find that a lot of organisations lack basic Unicode knowledge.

So lets put down a few things, that helps as a primer and gets some confusion out of the way.

Please read the article on Unicode by Joel on Software, and the book Unicode Explained. The book is from 1996, and still very valid.

Unicode

Unicode started in the late 80s of last century as a 16-bit character model.

Somehow lots of people still thing Unicode is a 16-bit double-byte character set. It is not. It uses a variable width encoding for storage.

All encodings except the 32-bit ones are variable width. The UTF-16 encoding is a variable width encoding where each code point (not character!, see below why) takes one or more 16-bit words.

This is because – as of Unicode version 2.0 in 1996 – a surrogate character mechanism was introduced to be able to have more than 64k code points.

The architecture of Unicode is completely different than traditional single-byte character sets or double-byte character sets.

In Unicode, there is a distinction between code points (the mapping of the character to an actual IDs), storage/encoding (in Windows now uses UTF-16LE which includes the past used UCS-2) and leaves visual representation (glyphs/renderings) to fonts.

Unicode has over a million code points, logically divided into 17 planes, of which the Basic Multi-lingual Plane has code points that can be encoded into one 16-bit word.

There is no font that can display all Unicode code points. By original aim, the first 256 Unicode code points are identical to the ISO 8859-1 character set (which is Windows-29591, not Windows-1252!) for which most fonts can display most characters.

I entity Unicode (Windows version)

By now, you probably grasp that Unicode is not an easy thing to get right. And that can be hard, hence people love and hate Unicode at the same time. Maybe I should get the T-Shirt :).

One thing that complexes things, is that Unicode allows for both composite characters and ready made composites. This is one form where different sequences can be equivalent, so there can be Unicode equivalence for which you need some knowledge on Unicode Normalization (be sure to read this StackOverflow question and this article by Michael Kaplan on Unicode Normalization).

There are many Unicode encodings, of which UTF-8 and UTF-16 are the most widely used (and are variable length). UTF-32 is fixed length. All 16-bit and 32-bit encodings can have big-endian and little-endian storage and can use a Byte Order Mark (BOM) to indicate their endinaness. Not all software uses BOMs, and there are BOMs for UTF-8 and other encodings as well (for UTF-8 it is not recommended to include a BOM).

When only parts your development environment supports Unicode strings, you need to be aware of which do and which don’t. For any interface boundary between those, you need to be aware of potential data loss, and need to decide how to cope with that.

For instance, does your database use Unicode or not for character storage? (For Microsoft SQL Server: do you use CHAR/VARCHAR or NCHAR/NVARCHARyou should aim for NVARCHAR, yes you really should, do not use text, ntext and image). What do you do while transferring Unicode and non-Unicode text to it? Ask the same questions for Web Services, configuration files, binary storage, message queueing and various other interfaces to the outside world.

The Windows API is almost exclusively Unicode (see this StackOverflow question for more details)

Delphi and Unicode

Let’s focus a bit on Delphi now, as that the migration towards Unicode at clients raised a few questions over the last couple of months.

One of the key questions is why there are no conversion tools that help you migrate your existing source code to fully embrace Unicode.

The short answer is: because you can’t automate the detection of intent in your codebase.

The longer answer starts with that there are tools that detect parts of your Delphi source that potentially has problems: the compiler hints, warnings and errors that brings your attention to spots that are fishy, are likely to fail, or are plain wrong.

Delphi uses the standard Windows storage format for Unicode text: UTF-16LE.

Next to that, Delphi supports conversion to and from UTF-8 en UTF-32 (in their various forms endianness).

External storage of text is best done as UTF-8 because it doesn’t have endianness, and because of easier exchange of text in ISO-8859-1.

Marco Cantu wrote a very nice whitepaper about Delphi and Unicode, and I did a Delphi Unicode talk at CodeRage 4 and posted a lot of Delphi Unicode links at StackOverflow.

A few extra notes on Delphi and Unicode:

With Delphi string types, stick to the UnicodeString (default string as of Delphi 2009) and AnsiString (default string until Delphi 2007) as their memory management is done by Delphi. WideString management is done by COM, so only use that when you really need to. Also avoid ShortString.

For any interfaces to the external world, you need to decide which ones to keep to generic string, Char, PChar and which ones to fix to AnsiChar/PAnsiChar/AnsiString(+ accompanying codepage) or fix at UnicodeChar/PUnicodeChar/UnicodeString.

Of course remnants from the past will catch up with you: if you have Technical Debt on the past where characters were bytes, and you abused Char/PChar/array-of-char/etc you need to fix that, and use the Byte/PByte/TByteArray/PByteArray. It can be costly to pay the accrued debt on that.

–jeroen

PS:

Posted in .NET, C#, Delphi, Development, EBCDIC, Encoding, ISO-8859, Software Development, Technical Debt, Unicode, UTF-8 | 2 Comments »

Shapecatcher.com: Unicode Character Recognition

Posted by jpluimers on 2012/03/26

There are so many Unicode code points, that finding back one based on how a glyph looks like is similar to finding the needle in a haystack.

ShapeCatcher.com to the rescue: it has a drawing box where you can draw the glyph, and a recognition engine that presents you with Unicode code points that have a similar glyph.

Draw something in the left box!

And let shapecatcher help you to find the most similar unicode characters!

–jeroen

via: Shapecatcher.com: Unicode Character Recognition.

Posted in Development, Encoding, LifeHacker, Power User, Software Development, Unicode | 1 Comment »

P/Invoke: usually you need CharSet.Auto (via: .NET Column: Calling Win32 DLLs in C# with P/Invoke)

Posted by jpluimers on 2012/02/28

I don’t do P/Invoke often, and somehow I have trouble remembering the value of CharSet to pass with DllImport.

In short, pass CharSet.Auto unless you P/Invoke a function that is specific to CharSet.Ansi or CharSet.Unicode. The default is CharSet.Ansi, which you usually don’t want:

when Char or String data is part of the equation, set the CharSet property to CharSet.Auto. This causes the CLR to use the appropriate character set based on the host OS. If you don’t explicitly set the CharSet property, then its default is CharSet.Ansi. This default is unfortunate because it negatively affects the performance of text parameter marshaling for interop calls made on Windows 2000, Windows XP, and Windows NT®.

The only time you should explicitly select a CharSet value of CharSet.Ansi or CharSet.Unicode, rather than going with CharSet.Auto, is when you are explicitly naming an exported function that is specific to one or the other of the two flavors of Win32 OS. An example of this is the ReadDirectoryChangesW API function, which exists only in Windows NT-based operating systems and supports Unicode only; in this case you should use CharSet.Unicode explicitly.

–jeroen

via: .NET Column: Calling Win32 DLLs in C# with P/Invoke.

Posted in .NET, Ansi, C#, Delphi, Development, Encoding, Prism, Software Development, Unicode | 3 Comments »

Registry Search-Replace tools – correcting the havoc after a data migration

Posted by jpluimers on 2011/10/07

A while ago, I found myself in the situation where at a corporate client the user profiles had moved on the LAN. Very understandable: it was one of the migrations towards DFS. They notified this in advance, so I made backups of everything (home drive and user profile) just to make sure.

The move indeed caused all sorts of havoc, because the data was moved, but the registry was only slightly modified.

Some of the errors I got were like these:

[Internet Explorer - Search Provider Default]
A program on your computer has corrupted your default search provider setting for Internet Explorer.

Internet Explorer has reset this setting to your original search provider, Live Search (search.live.com).

Internet Explorer will now open Search Settings, where you can change this setting or install more search providers.
[OK]

and

[Desktop]
\\old-server\old-share\user-id\Desktop refers to a location that is unavailable. It could be on a hard drive on this computer, or on a network. Check to make sure that the disk is properly inserted, or that you are connected to the Internet or your network, and then try again. If it still cannot be located, the information might have been moved to a different location.
[OK]

Below some of the ramblings on what I did to get everything working again, including registry searches when you are not allowed to run RegEdit, searching through text, and the places in the registry that had to change. Read the rest of this entry »

Posted in .NET, C#, Delphi, Development, Encoding, Power User, Software Development, Unicode | Leave a Comment »

Long pathname support: Watch for MAX_PATH (was: Windows pathname max length problem « Dropbox Forums)

Posted by jpluimers on 2010/06/16

When you want to support long pathnames on Windows, you need to watch for the MAX_PATH limitation.
Some tools, like RoboCopy (developed in C++), Beyond Compare (developed in Delphi) and others get it right.
Getting it right does not depend in your development environment: it is all about calling the right API’s with the right parameters.

Let me take a tool – in this case DropBox, though other tools suffer from the same problem – and investigate how they should do it.

Even though DropBox is cross platform, the Windows version of DropBox limits itself to synchronizing files having less than 260 characters in their path.
This is a big drawback: it is so 20th century having a limitation like this. Read the rest of this entry »

Posted in .NET, Delphi, Development, Encoding, Opinions, Software Development, Unicode | 6 Comments »

Why SizeOf for character arrays is evil: stackoverflow question “Why does this code fail in D2010, but not D7?”

Posted by jpluimers on 2010/05/11

This Why does this code fail in D2010, but not D7 question on stackoverflow once again shows that SizeOf on character arrays usualy is evil.

My point in this posting is that you should always try to write code that is semantically correct.

By writing semantically correct code, you have a much better chance of surviving a major change like a Unicode port.

The code below is semantically wrong: it worked in Delphi 7 by accident, not by design:
Like many Windows API functions, GetTempPath expects the first parameter (called nBufferLength) number of characters, not the number of bytes. Read the rest of this entry »

Posted in Delphi, Delphi 2005, Delphi 2006, Delphi 2007, Delphi 2009, Delphi 2010, Delphi 3, Delphi 4, Delphi 5, Delphi 6, Delphi 7, Delphi XE, Delphi XE2, Delphi XE3, Development, Encoding, ISO-8859, Software Development, Unicode | Leave a Comment »

.NET/C# – TEE filter that also runs on Windows (XP) Embedded – update

Posted by jpluimers on 2010/04/15

Last week, I posted a C# implementation of the tee filter from Sterling W. “Chip” Camden.

Since then I have modified it slightly.
Not because the implementation is bad, but because some pieces of software play dirty when saving their redirected output.

One of those applications is SubInAcl, otherwise a great tool for showing and modifying ACL and Ownership information of Windows NT objects (files, registry entries, etc).

However, when redirecing output or piping it, it writes a zero byte after each byte of text.
I’m not sure why: it might try to face some Unicode output, or just be buggy.

The new sourcecode is below. You can also download the project and binary as tee.C#.7z (you need the freeware 7zip compression tool to decompress this).
Read the rest of this entry »

Posted in .NET, ASCII, C#, C# 2.0, CommandLine, Development, Encoding, Software Development, Unicode, Visual Studio and tools | Leave a Comment »

*nix – Mastering the VI editor

Posted by jpluimers on 2010/04/13

Every once in a while I need to do some text editing in a *nix environment that has a minimum toolset installed.

Which means: use VI.

VI is a versatile text editor from the early *nix days, but it is not straight forward to use.
Since I don’t use VI often enough, I tend to forget some of the commands.

Time to share my favourite VI link: Mastering the VI editor edit: link rot, now it is at Mastering the VI editor.

The link points to the basic stuff, but the page contains most of what you ever want to know about VI.

–jeroen

PS:
If possible, I install the JOE text editor on systems where I am admin.
JOE uses WordStar like key bindings, and supports UTF-8. Talking about “something old, something new” :-)

Posted in *nix, CommandLine, Development, Encoding, Power User, Software Development, Unicode, UTF-8, UTF8, vi | 1 Comment »

.NET/C# – TEE filter that also runs on Windows (XP) Embedded

Posted by jpluimers on 2010/04/07

The usage of tee - image courtesey of Wikipedia

The tee command stems from a *nix background.
It is a command-line filter that allows you to deviate a stream from the regular stdout/stdin redirected pipeline into a file.

Recently, I needed this in a Windows Embedded Stadard (a.k.a. WES) system for logging purposes.
This way, a post-install-script (similar to the Windows Post-Install Wizard, but command-line based) could log to both the console and a log-file at the same time.

WES is the successor Windows XP Embedded (a.k.a. XPe), which is a modularized version of Windows XP.
Se WES usually means that you don’t have the luxury of everything that Windows XP has.
This in turn means that you need to be careful when selecting external tools: a lot of stuff that works on plain Windows XP won’t work.

There are various Win32 ports of tee available.
This time however, I needed a Unicode implementation, so I searched for a .NET based implementation.

Windows PowerShell 2.0 does contain a tee implementation, but:

  1. We don’t have the luxury of having PowerShell in our WES image
  2. PowerShell tee first writes the contents to e temporary file, which interferes with how we build this WES image.

Luckily Sterling W. “Chip” Camden started with such a .NET implementation of tee – in Visual C++ – back in 2005.
Though his TEE page indicates it is based on .NET 1.1, his current implementation is done in Visual Studio 2008 using C++.

Now that is a problem for the targeted WES image: that image is based on .NET 2.0.
But when using Visual C++ in .NET, you need additional run-time libraries (for instance the ones for Visual C++ 2005, or the ones for Visual C++ 2008).

If you don’t have these installed, tee.exe does not start, and you get error messages like this on the command-line:

K:\Post-Install-Scripts>tee
The system cannot execute the specified program.

and entries like this in the Eventlog:

Event Type: Error
Event Source: SideBySide
Event Category: None
Event ID: 59
Date: 01/04/2010
Time: 19:09:22
User: N/A
Computer: MYMACHINE
Description:
Generate Activation Context failed for K:\Post-Install-Scripts\tee.exe. Reference error message: The operation completed successfully.

The odd thing in this error message is “The operation completed successfully”: it didn’t :-)

Anyway: translating the underlying C++ code to C# is pretty straightforward, so:

The C# implementation

I did change a few things, none of them major:

  • replaced some for statements with foreach
  • renamed a few variables to make them more readable
  • added using statements for stdin and stdout
  • added try…finally for cleaning up the binary writers
  • moved the logic for duplicate filenames into a separate method, and moved the moment of checking to the point of adding the filename to the filenames
  • moved the help into a separate method
  • added support for the -h (same behaviour as –help or /?) command-line argument

The implementation is pretty straightforward:

  • Perform parameter parsing
  • Catch all input bytes from the stdin stream
  • Copy those bytes to both the stdout stream, and the files specified on the command-line
  • Send errors to the stderr stream
  • Do the proper initialization and cleanup

This is the C# code:

using System;
using System.IO;
using System.Collections.Generic;
// Sends standard input to standard output and to all files in command line.
// C# implementation april 4th, 2010 by Jeroen Wiert Pluimers (https://wiert.wordpress.com),
// based on tee Chip Camden, Camden Software Consulting, November 2005
// 	... and Anonymous Cowards everywhere!
//
// TEE [-a | --append] [-i | --ignore] [--help | /?] [-f] [file1] [...]
//    Example:
// 	tee --append file0.txt -f --help file2.txt
//    will append to file0.txt, --help, and file2.txt
//
// -a | --append	Appends files instead of overwriting
// 			  (setting is per tee instance)
// -i | --ignore	Ignore cancel Ctrl+C keypress: see UnixUtils tee
// /? | --help		Displays this message and immediately quits
// -f			Stop recognizing flags, force all following filenames literally
//
// Duplicate filenames are quietly ignored.
// Press Ctrl+Z (End of File character) then Enter to abort.
namespace tee
{
    class Program
    {
        static void help()
        {
            Console.Error.WriteLine("Sends standard input to standard output and to all files in command line.");
            Console.Error.WriteLine("C# implementation april 4th, 2010 by Jeroen Wiert Pluimers (https://wiert.wordpress.com),");
            Console.Error.WriteLine("Chip Camden, Camden Software Consulting, November 2005");
            Console.Error.WriteLine("	... and Anonymous Cowards everywhere!");
            Console.Error.WriteLine("http://www.camdensoftware.com");
            Console.Error.WriteLine("http://chipstips.com/?tag=cpptee");
            Console.Error.WriteLine("");
            Console.Error.WriteLine("tee [-a | --append] [-i | --ignore] [--help | /?] [-f] [file1] [...]");
            Console.Error.WriteLine("   Example:");
            Console.Error.WriteLine(" tee --append file0.txt -f --help file2.txt");
            Console.Error.WriteLine("   will append to file0.txt, --help, and file2.txt");
            Console.Error.WriteLine("");
            Console.Error.WriteLine("-a | --append    Appends files instead of overwriting");
            Console.Error.WriteLine("                 (setting is per tee instance)");
            Console.Error.WriteLine("-i | --ignore    Ignore cancel Ctrl+C keypress: see UnixUtils tee");
            Console.Error.WriteLine("/? | --help      Displays this message and immediately quits");
            Console.Error.WriteLine("-f               Stop recognizing flags, force all following filenames literally");
            Console.Error.WriteLine("");
            Console.Error.WriteLine("Duplicate filenames are quietly ignored.");
            Console.Error.WriteLine("Press Ctrl+Z (End of File character) then Enter to abort.");
        }

        static void OnCancelKeyPressed(Object sender, ConsoleCancelEventArgs args)
        {
            // Set the Cancel property to true to prevent the process from
            // terminating.
            args.Cancel = true;
        }

        static List<String> filenames = new List<String>();

        static void addFilename(string value)
        {
            if (-1 == filenames.IndexOf(value))
                filenames.Add(value);
        }

        static int Main(string[] args)
        {
            try
            {
                bool appendToFiles = false;
                bool stopInterpretingFlags = false;
                bool ignoreCtrlC = false;

                foreach (string arg in args)
                {
                    //Since we're already parsing.... might as well check for flags:
                    if (stopInterpretingFlags)  //Stop interpreting flags, assume is filename
                    {
                        addFilename(arg);
                    }
                    else if (arg.Equals("/?") || arg.Equals("-h") || arg.Equals("--help"))
                    {
                        help();
                        return 1; //Quit immediately
                    }
                    else if (arg.Equals("-a") || arg.Equals("--append"))
                    {
                        appendToFiles = true;
                    }
                    else if (arg.Equals("-i") || arg.Equals("--ignore"))
                    {
                        ignoreCtrlC = true;
                    }
                    else if (arg.Equals("-f"))
                    {
                        stopInterpretingFlags = true;
                    }
                    else
                    {	//If it isn't any of the above, it's a filename
                        addFilename(arg);
                    }
                    //Add more flags as necessary, just remember to SKIP adding them to the file processing stream!
                }

                if (ignoreCtrlC) //Implement the Ctrl+C fix selectively (mirror UnixUtils tee behavior)
                    Console.CancelKeyPress += new ConsoleCancelEventHandler(OnCancelKeyPressed);

                List<BinaryWriter> binaryWriters = new List<BinaryWriter>(filenames.Count); //Add only as many streams as there are distinct files
                try
                {
                    foreach (String filename in filenames)
                    {
                        binaryWriters.Add(new BinaryWriter(appendToFiles ?
                            File.AppendText(filename).BaseStream :
                            File.Create(filename)));  // Open the files specified as arguments
                    }
                    using (BinaryReader stdin = new BinaryReader(Console.OpenStandardInput()))
                    {
                        using (BinaryWriter stdout = new BinaryWriter(Console.OpenStandardOutput()))
                        {
                            Byte b;
                            while (true)
                            {
                                try
                                {
                                    b = stdin.ReadByte();  // Read standard in
                                }
                                catch (EndOfStreamException)
                                {
                                    break;
                                }
                                // The actual tee:
                                stdout.Write(b); // Write standard out
                                foreach (BinaryWriter binaryWriter in binaryWriters)
                                {
                                    binaryWriter.Write(b); // Write to each file
                                }
                            }
                        }
                    }
                }
                finally
                {
                    foreach (BinaryWriter binaryWriter in binaryWriters)
                    {
                        binaryWriter.Flush();  // Flush and close each file
                        binaryWriter.Close();
                    }
                }
            }
            catch (Exception ex)
            {
                Console.Error.WriteLine(String.Concat("tee: ", ex.Message));  // Send error messages to stderr
            }

            return 0;
        }
    }
}

Some alternatives that might (or might not) support unicode:

http://www.commandline.co.uk/mtee/
http://unxutils.sourceforge.net/ (cannot be downloaded any more – pitty, as they were pretty good)

–jeroen

Update: 201009041030 – Syntax highlighting didn’t work, so changed
sourcecode language=”C#
into
sourcecode language=”csharp

Posted in .NET, C#, C# 2.0, CommandLine, Development, Encoding, Power User, Software Development, Unicode, UTF-8, Visual Studio and tools, XP-embedded | 13 Comments »

Web means Unicode

Posted by jpluimers on 2010/02/12

Google published an interesting graph generated from their internal data based on their indexed web pages.Encodings on the web

A quick summary of popular encodings based on the graph:

  1. Unicode – almost 50% and rapidly rising
  2. ASCII20% and falling
  3. Western European* – 20% and falling
  4. Rest – 10% and falling

Conclusion: if you do something with the web, make sure you support Unicode.

When you are using Delphi, and need help with transitioning to Unicode: contact me.

–jeroen

* Western European encodings: Windows-1252, ISO-8859-1 and ISO-8859-15.

Reference: Official Google Blog: Unicode nearing 50% of the web.

Edit: 20100212T1500

Some people mentioned (either in the comments or otherwise) that a some sites pretend they emit Unicode, but in fact they don’t.
This doesn’t relieve you from making sure you support Unicode: Don’t pretend you support Unicode, but do it properly!

Examples of bad support for Unicode are not limited to the visible web, but also applications talking to the web, and to webservices (one of my own experiences is explained in StUF – receiving data from a provider where UTF-8 is in fact ISO-8859: it shows an example where a vendor does Unicode support really wrong).

So: when you support Unicode, support it properly.

–jeroen

Posted in .NET, ASP.NET, C#, Database Development, Delphi, Development, Encoding, Firebird, IIS, InterBase, ISO-8859, ISO8859, Prism, SOAP/WebServices, Software Development, SQL Server, Unicode, UTF-8, UTF8, Visual Studio and tools, Web Development | 7 Comments »