March 2026
M	T	W	T	F	S	S
	1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Archive for the ‘Unicode’ Category

.NET/C#: from Unicode to ASCII (yes, this is one-way): converting Diacritics to “regular” ASCII characters.

Posted by jpluimers on 2013/06/11

A while ago, I needed to export pure ASCII text from a .NET app.

An important step there is to convert the diacritics to “normal” ASCII characters. That turned out to be enough for this case.

This is the code I used which is based on Extension Methods and this trick from Blair Conrad:

The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the “base” characters from the diacritics) and then scans the result and retains only the base characters. It’s just a little complicated, but really you’re looking at a complicated problem.

Example code:

using System;
using System.Text;
using System.Globalization;

namespace StringToAsciiConsoleApplication
{
    class Program
    {
        static void Main(string[] args)
        {
            string unicode = "áìôüç";
            string ascii = unicode.ToAscii();
            Console.WriteLine("Unicode\t{0}", unicode);
            Console.WriteLine("ASCII\t{0}", ascii);
        }
    }

    public static class StringExtensions
    {
        public static string ToAscii(this string value)
        {
            return RemoveDiacritics(value);
        }

        // http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net
        private static string RemoveDiacritics(this string value)
        {
            string valueFormD = value.Normalize(NormalizationForm.FormD);
            StringBuilder stringBuilder = new StringBuilder();

            foreach (System.Char item in valueFormD)
            {
                UnicodeCategory unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(item);
                if (unicodeCategory != UnicodeCategory.NonSpacingMark)
                {
                    stringBuilder.Append(item);
                }
            }

            return (stringBuilder.ToString().Normalize(NormalizationForm.FormC));
        }
    }
}

–jeroen

Posted in .NET, .NET 3.5, .NET 4.0, .NET 4.5, ASCII, C#, C# 3.0, C# 4.0, C# 5.0, Development, Encoding, Software Development, Unicode | Leave a Comment »

What programmers font (monospaced!) do you like best?

Posted by jpluimers on 2013/05/08

Lucida Console Sample (thanks Wikimedia!)

I’m in search to see if there is a better programmers font than the monospaced Lucida Console mainly to be used in Visual Studio, Delphi, the Windows console, Xcode and Eclipse. What I love about Lucida Console design is the relatively large x-height combined with a small leading (often called “line height”). This combines very readable text, and a lot of code lines in view. Lucida has two small drawbacks, see the second image at the right:

The captial O and digit 0 (zero) are very similar.
Some uppercase/lowercase character pairs are alike (because of the large x-height)

But, since the font hasn’t been updated for a very long time, lots of Unicode code points that are now in current fonts, are missing from Lucida Console (unless you buy the [Wayback] most recent version that has 666 characters from Fonts.com) Well, there are dozens of monospaced fonts around, so I wonder: which ones do you like? In the mean while, I’m going to do some experimenting with fonts mentioned in these lists:

A few fonts I’m considering (I only want scalable fonts, so raster .fon files are out):

[Wayback] Anonymous Pro
[Wayback] Crystal
[Wayback/Archive.is] Envy Code R preview #7 (scalable coding font)
[Wayback] Hyperfont
[Wayback] Inconsolata-dz
[Wayback] ProFontWindows
[Wayback] DejaVu

I have tried Adobe Source Code Pro about half a year ago. That didn’t cut it: problem with italics in Delphi, and not enough lines per screen. [Wayback] New Open Source monospaced font from Adobe: Source Code Pro.

–jeroen

Posted in .NET, Adobe Source Code Pro, Apple, Delphi, Delphi 2007, Delphi XE3, Development, Encoding, Font, Lucida Console, Mac, Mac OS X / OS X / MacOS, Power User, Programmers Font, Software Development, Typography, Unicode, Visual Studio 11, Visual Studio 2005, Visual Studio 2008, Visual Studio 2010, Visual Studio and tools, Windows, Windows 7, Windows 8, Windows Server 2008 R2, Windows XP, xCode/Mac/iPad/iPhone/iOS/cocoa | 43 Comments »

Foute foutmelding @heldenvannu (inschrijving: Pakketten | HELDEN VAN . NU)

Posted by jpluimers on 2013/03/29

Als je postcode “1060 NP” invult bij aansluiting en “1170 AB” bij facturatie, dan krijg je deze onterechte foutmeldingen:

Organisatie postcode is ongeldig
Facturatie gegevens postcode is ongeldig

Beetje vreemd, want al sinds de introductie van postcodes in Nederland in 1978 zit in een postcode 1 spatie, en tussen de postcode en de woonplaats 2 spaties.

Ook trema‘s gaan mis: bij postcode 1060 NP hoort de straat Pyreneeën in Amsterdam, maar bij Heldenvan.nu wordt het deze Mojibake:

Read the rest of this entry »

Posted in Development, Encoding, Mojibake, Opinions, Power User, Software Development, Unicode | Leave a Comment »

Delphi “Variant Records”, a few notes

Posted by jpluimers on 2013/03/14

Variant Records are a feature that has been in the Pascal language since Standard Pascal.

A cool page for historic perspective is R3R: Pascal Features in Popular Compilers, hopefully someone will update it to more modern versions of the mentioned compilers.

There is not much official documentation on the Delphi side on this, so below some parts of a case I used for a project that started in 1997 and is still in use to day. Read the rest of this entry »

Posted in APPC, AS/400 / iSeries / System i, ASCII, COBOL, Communications Development, Conference Topics, Conferences, CPI-C, Delphi, Delphi 1, Delphi 2005, Delphi 2006, Delphi 2007, Delphi 2009, Delphi 2010, Delphi 3, Delphi 4, Delphi 5, Delphi 6, Delphi 7, Delphi 8, Delphi XE, Delphi XE2, Delphi XE3, Development, Encoding, Event, HIS Host Integration Services, Internet protocol suite, MQ Message Queueing/Queuing, SNA, Software Development, TCP, Unicode, UTF-8, WebSphere MQ | 9 Comments »

Delphi “type types”: similar types but not the same type identity, some examples.

Posted by jpluimers on 2013/03/12

Few people know about a Delphi language feature that has been present since Delphi 1: prepending the type definition with a type keyword to make the type getting a new identity.

Each time I use it, I have to do some browsing for the consequences, and this time I wrote down some notes and created a small example program (source is also below).

This time I needed it when writing class wrappers on top of the Delphi bindings for WebSphere MQ.

WebSphere MQ has Queues where you can put and get messages. It also has Queue Managers to which you connect, and that provide queuing services and manages queues.

Both Queues and Queue Managers have names that can be up to 48 (single byte) characters long.
Those names mean totally different things, so though the have similar data types, they have a different identity.

The same holds for 20 byte character arrays (they can be used as names for ChannelName, ShortConnectionName and MCAName). The 264 byte character array is so far used for ConnectionName only.

Distinguishing those types: That’s what “type types” in Delphi are all about. Read the rest of this entry »

Posted in CP437/OEM 437/PC-8, Delphi, Delphi 1, Delphi 2005, Delphi 2006, Delphi 2007, Delphi 2009, Delphi 2010, Delphi 3, Delphi 4, Delphi 5, Delphi 6, Delphi 7, Delphi 8, Delphi x64, Delphi XE, Delphi XE2, Delphi XE3, Development, Encoding, Shift JIS, Software Development, Unicode, UTF-8, UTF8 | 1 Comment »

Link clearance: fonts, localization, languages, internationalization, PostScript, and more

Posted by jpluimers on 2013/03/01

A few links I came across recently:

internationalization – Country codes list – C# – Stack Overflow.
c# – Converting country codes in .NET – Stack Overflow.
CLDR – Unicode Common Locale Data Repository.
Common Locale Data Repository – Wikipedia, the free encyclopedia.
localization – C#: get letters of alphabet for scandinavian language? – Stack Overflow.
– The exempla characters in CLDR, type=”index” are what you are looking for. unicode.org/reports/tr35/#Character_Elements – Steven R. Loomis May 14 ’10 at 18:15
– There’s a good set of reference documents at the Evertype website
Evertype: The Alphabets of Europe.
TZ4Net library. (TimeZone 4.NET: uses Unicode CLDR v.2.1 based mapping between Win32 Id and Olson name)
Evertype.
Software I used in the 90s for digitizing fonts to produce PostScript and Type 1 fonts
– Fontographer – Wikipedia, the free encyclopedia
– Ikarus (typography software) – Wikipedia, the free encyclopedia.
Visualogik Technology & Design / Hans van Leeuwen.

–jeroen

Posted in About, Development, Encoding, EPS/PostScript, Font, internatiolanization (i18n) and localization (l10), Personal, Power User, Programmers Font, Software Development, Unicode | Leave a Comment »

START: Start a program, even if it is not on the PATH ideal to start various versions of apps from DOS

Posted by jpluimers on 2013/01/29

A while ago, I had to adapt a DOS app that used one specific version of Excel to do some batch processing so it would support multiple versions of Excel on multiple versions of Windows.

One of the big drawbacks of DOS applications is that the command lines you can use are even shorter than Windows applications, which depending you how you call an application are:

32767 characters when you call CreateProcess (the limit is UNICODE_STRING structure)
8192 characters when you use Cmd.exe
2048+32 characters when you use ShellExecute or ShellExecuteEx (the limit is INTERNET_MAX_URL_LENGTH)
260 characters for the Windows 95 family of products (the limit is MAX_PATH)
127 characters for DOS (the upper limit of a signed byte) often excluding the length of “cmd.exe” or “command.exe”

This is how the DOS app written in Clipper (those were the days, it was even linked with Blinker :) started Excel:

c:\progra~1\micros~2\office11\excel.exe parameters
01234567890123456789012345678901234567890
          1         2         3         4

The above depends on 8.3 short file names that in turn depend on the order in which similar named files and directories have been created.

The trick around this, and around different locations/versions of an application, is to use START to find the right version of Excel.

The reason it works is because in addition to PATH, it checks the App Paths portions in the registry in this order to find an executable: Read the rest of this entry »

Posted in Batch-Files, Development, Encoding, Power User, Scripting, Software Development, Unicode, Windows, Windows 7, Windows 8, Windows Server 2000, Windows Server 2003, Windows Server 2003 R2, Windows Server 2008, Windows Server 2008 R2, Windows Vista, Windows XP | Leave a Comment »

Delphi and C# compiler oddities

Posted by jpluimers on 2013/01/08

When developing in multiple languages, it sometimes is funny to see how they differ in compiler oddities.

Below are a few on const examples.

Basically, in C# you cannot go from a char const to a string const, and chars are a special kind of int.

In Delphi you cannot go from a string to a char. Read the rest of this entry »

Posted in .NET, ASCII, C#, C# 1.0, C# 2.0, C# 3.0, C# 4.0, C# 5.0, Delphi, Delphi 2009, Delphi 2010, Delphi XE, Delphi XE2, Development, Encoding, Software Development, Unicode | Leave a Comment »

.NET/C# duh moment of the day: “A char can be implicitly converted to ushort, int, uint, long, ulong, float, double, or decimal (not the other way around; implicit != implicit)”

Posted by jpluimers on 2012/11/20

A while ago I had a “duh” moment while calling a method that had many overloads, and one of the overloads was using int, not the char I’d expect.

The result was that a default value for that char was used, and my parameter was interpreted as a (very small) buffer size. I only found out something went wrong when writing unit tests around my code.

The culprit is this C# char feature (other implicit type conversions nicely summarized by Muhammad Javed):

A char can be implicitly converted to ushort, int, uint, long, ulong, float, double, or decimal. However, there are no implicit conversions from other types to the char type.

Switching between various development environments, I totally forgot this is the case in languages based on C and Java ancestry. But not in VB and Delphi ancestry (C/C++ do numeric promotions of char to int and Java widens 2-byte char to 4-byte int; Delphi and VB.net don’t).

I’m not the only one who was confused, so Eric Lippert wrote a nice blog post on it in 2009: Why does char convert implicitly to ushort but not vice versa? – Fabulous Adventures In Coding – Site Home – MSDN Blogs.

Basically, it is the C ancestry: a char is an integral type always known to contain an integer value representing a Unicode character. The opposite is not true: an integer type is not always representing a Unicode character.

Lesson learned: if you have a large number of overloads (either writing them or using them) watch for mixing char and int parameters.

Note that overload resolution can be diffucult enough (C# 3 had breaking changes and C# 4 had breaking changes too, and those are only for C#), so don’t make it more difficult than it should be (:

Below a few examples in C# and VB and their IL disassemblies to illustrate their differnces based on asterisk (*) and space ( ) that also show that not all implicits are created equal: Decimal is done at run-time, the rest at compile time.

Note that the order of the methods is alphabetic, but the calls are in order of the type and size of the numeric types (integral types, then floating point types, then decimal).

A few interesting observations:

The C# compiler implicitly converts char with all calls except for decimal, where an implicit conversion at run time is used:
L_004c: call valuetype [mscorlib]System.Decimal [mscorlib]System.Decimal::op_Implicit(char)
L_0051: call void CharIntCompatibilityCSharp.Program::writeLineDecimal(valuetype [mscorlib]System.Decimal)
Same for implicit conversion of byte to the other types, though here the C# and VB.NET compilers generate slightly different code for run-time conversion.
C# uses an implicit conversion:
L_00af: ldloc.1
L_00b0: call valuetype [mscorlib]System.Decimal [mscorlib]System.Decimal::op_Implicit(uint8)
L_00b5: call void CharIntCompatibilityCSharp.Program::writeLineDecimal(valuetype [mscorlib]System.Decimal)
VB.NET calls a constructor:
L_006e: ldloc.1
L_006f: newobj instance void [mscorlib]System.Decimal::.ctor(int32)
L_0075: call void CharIntCompatibilityVB.Program::writeLineDecimal(valuetype [mscorlib]System.Decimal)

Here is the example code: Read the rest of this entry »

Posted in .NET, Agile, Algorithms, C#, C# 1.0, C# 2.0, C# 3.0, C# 4.0, C# 5.0, C++, Delphi, Development, Encoding, Floating point handling, Java, Software Development, Unicode, Unit Testing, VB.NET | 1 Comment »

XML and HTML escapes

Posted by jpluimers on 2012/07/26

While reviewing some client’s code, I noticed they were generating and parsing XML and HTML by hand (do not ever do that yourself!).

Before refactoring this into something that uses libraries that properly understand XML and HTML, I needed to assess some of the risks.

A major risk is to get the escaping (and unescaping) of XML and HTML right.

Time to finally organize some of my links on escaping HTML and XML that I had in my favourites list.

The starting point is the List of XML and HTML character entity references on Wikipedia. It is readable, complete and lists both kinds of escapes.

XML escapes

The official W3C text that describes XML escaping is hard to read.

There are only 5 predefined XML entities for characters that can (some must) be escaped. This table is derived from the Wikipedia article.

Name	Character	Unicode code point (decimal)	Standard	When to escape (from the XML 1.0 standard)	Description
quot	“	U+0022 (34)	XML 1.0	To allow attribute values to contain both single and double quotes	double quotation mark
amp	&	U+0026 (38)	XML 1.0	Outside comment, a processing instruction, or a CDATA section	ampersand
apos	‘	U+0027 (39)	XML 1.0	To allow attribute values to contain both single and double quotes	apostrophe (= apostrophe-quote)
lt	<	U+003C (60)	XML 1.0	Outside comment, a processing instruction, or a CDATA section	less-than sign
gt	>	U+003E (62)	XML 1.0	in content, when that string is not marking the end of a CDATA section	greater-than sign

HTML escapes

Read the rest of this entry »

Posted in " quot, & amp, > gt, < lt, ' apos, ASCII, Development, Encoding, HTML, Power User, SocialMedia, Software Development, Unicode, Web Development, WordPress, XML, XML escapes, XML/XSD | 1 Comment »

« Previous Entries

Next Entries »

	Jeroen Wiert Pluimer… on Pie Comic by John McNamee: Mov…
	Attila Kovacs on Crowbarring Windows 95 into Wi…
	Jeroen Wiert Pluimer… on Does Odido (the old T-Mobile N…
	Lars Fosdal on Security alarm provider Woonve…
	Thomas Mueller on Question got closed in May 202…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘Unicode’ Category

.NET/C#: from Unicode to ASCII (yes, this is one-way): converting Diacritics to “regular” ASCII characters.

What programmers font (monospaced!) do you like best?

Foute foutmelding @heldenvannu (inschrijving: Pakketten | HELDEN VAN . NU)

Delphi “Variant Records”, a few notes

Delphi “type types”: similar types but not the same type identity, some examples.

Link clearance: fonts, localization, languages, internationalization, PostScript, and more

START: Start a program, even if it is not on the PATH ideal to start various versions of apps from DOS

Delphi and C# compiler oddities

.NET/C# duh moment of the day: “A char can be implicitly converted to ushort, int, uint, long, ulong, float, double, or decimal (not the other way around; implicit != implicit)”

XML and HTML escapes

XML escapes

HTML escapes

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘Unicode’ Category

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

XML escapes

HTML escapes

Rate this:

Share this: