Archive for the ‘Unicode’ Category
Posted by jpluimers on 2013/06/11
A while ago, I needed to export pure ASCII text from a .NET app.
An important step there is to convert the diacritics to “normal” ASCII characters. That turned out to be enough for this case.
This is the code I used which is based on Extension Methods and this trick from Blair Conrad:
The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the “base” characters from the diacritics) and then scans the result and retains only the base characters. It’s just a little complicated, but really you’re looking at a complicated problem.
Example code:
using System;
using System.Text;
using System.Globalization;
namespace StringToAsciiConsoleApplication
{
class Program
{
static void Main(string[] args)
{
string unicode = "áìôüç";
string ascii = unicode.ToAscii();
Console.WriteLine("Unicode\t{0}", unicode);
Console.WriteLine("ASCII\t{0}", ascii);
}
}
public static class StringExtensions
{
public static string ToAscii(this string value)
{
return RemoveDiacritics(value);
}
// http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net
private static string RemoveDiacritics(this string value)
{
string valueFormD = value.Normalize(NormalizationForm.FormD);
StringBuilder stringBuilder = new StringBuilder();
foreach (System.Char item in valueFormD)
{
UnicodeCategory unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(item);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(item);
}
}
return (stringBuilder.ToString().Normalize(NormalizationForm.FormC));
}
}
}
–jeroen
Posted in .NET, .NET 3.5, .NET 4.0, .NET 4.5, ASCII, C#, C# 3.0, C# 4.0, C# 5.0, Development, Encoding, Software Development, Unicode | Leave a Comment »
Posted by jpluimers on 2013/05/08

Lucida Console Sample (thanks Wikimedia!)
I’m in search to see if there is a better programmers font than the monospaced Lucida Console mainly to be used in Visual Studio, Delphi, the Windows console, Xcode and Eclipse. What I love about Lucida Console design is the relatively large x-height combined with a small leading (often called “line height”). This combines very readable text, and a lot of code lines in view. Lucida has two small drawbacks, see the second image at the right:
- The captial O and digit 0 (zero) are very similar.
- Some uppercase/lowercase character pairs are alike (because of the large x-height)
But, since the font hasn’t been updated for a very long time, lots of Unicode code points that are now in current fonts, are missing from Lucida Console (unless you buy the [Wayback] most recent version that has 666 characters from Fonts.com) Well, there are dozens of monospaced fonts around, so I wonder: which ones do you like? In the mean while, I’m going to do some experimenting with fonts mentioned in these lists:
A few fonts I’m considering (I only want scalable fonts, so raster .fon files are out):
I have tried Adobe Source Code Pro about half a year ago. That didn’t cut it: problem with italics in Delphi, and not enough lines per screen. [Wayback] New Open Source monospaced font from Adobe: Source Code Pro.
–jeroen
Posted in .NET, Adobe Source Code Pro, Apple, Delphi, Delphi 2007, Delphi XE3, Development, Encoding, Font, Lucida Console, Mac, Mac OS X / OS X / MacOS, Power User, Programmers Font, Software Development, Typography, Unicode, Visual Studio 11, Visual Studio 2005, Visual Studio 2008, Visual Studio 2010, Visual Studio and tools, Windows, Windows 7, Windows 8, Windows Server 2008 R2, Windows XP, xCode/Mac/iPad/iPhone/iOS/cocoa | 43 Comments »
Posted by jpluimers on 2013/03/29
Als je postcode “1060 NP” invult bij aansluiting en “1170 AB” bij facturatie, dan krijg je deze onterechte foutmeldingen:
- Organisatie postcode is ongeldig
- Facturatie gegevens postcode is ongeldig
Beetje vreemd, want al sinds de introductie van postcodes in Nederland in 1978 zit in een postcode 1 spatie, en tussen de postcode en de woonplaats 2 spaties.
Ook trema‘s gaan mis: bij postcode 1060 NP hoort de straat Pyreneeën in Amsterdam, maar bij Heldenvan.nu wordt het deze Mojibake:
Read the rest of this entry »
Posted in Development, Encoding, Mojibake, Opinions, Power User, Software Development, Unicode | Leave a Comment »
Posted by jpluimers on 2013/03/14
Variant Records are a feature that has been in the Pascal language since Standard Pascal.
A cool page for historic perspective is R3R: Pascal Features in Popular Compilers, hopefully someone will update it to more modern versions of the mentioned compilers.
There is not much official documentation on the Delphi side on this, so below some parts of a case I used for a project that started in 1997 and is still in use to day. Read the rest of this entry »
Posted in APPC, AS/400 / iSeries / System i, ASCII, COBOL, Communications Development, Conference Topics, Conferences, CPI-C, Delphi, Delphi 1, Delphi 2005, Delphi 2006, Delphi 2007, Delphi 2009, Delphi 2010, Delphi 3, Delphi 4, Delphi 5, Delphi 6, Delphi 7, Delphi 8, Delphi XE, Delphi XE2, Delphi XE3, Development, Encoding, Event, HIS Host Integration Services, Internet protocol suite, MQ Message Queueing/Queuing, SNA, Software Development, TCP, Unicode, UTF-8, WebSphere MQ | 9 Comments »
Posted by jpluimers on 2013/03/12
Few people know about a Delphi language feature that has been present since Delphi 1: prepending the type definition with a type keyword to make the type getting a new identity.
Each time I use it, I have to do some browsing for the consequences, and this time I wrote down some notes and created a small example program (source is also below).
This time I needed it when writing class wrappers on top of the Delphi bindings for WebSphere MQ.
WebSphere MQ has Queues where you can put and get messages. It also has Queue Managers to which you connect, and that provide queuing services and manages queues.
Both Queues and Queue Managers have names that can be up to 48 (single byte) characters long.
Those names mean totally different things, so though the have similar data types, they have a different identity.
The same holds for 20 byte character arrays (they can be used as names for ChannelName, ShortConnectionName and MCAName). The 264 byte character array is so far used for ConnectionName only.
Distinguishing those types: That’s what “type types” in Delphi are all about. Read the rest of this entry »
Posted in CP437/OEM 437/PC-8, Delphi, Delphi 1, Delphi 2005, Delphi 2006, Delphi 2007, Delphi 2009, Delphi 2010, Delphi 3, Delphi 4, Delphi 5, Delphi 6, Delphi 7, Delphi 8, Delphi x64, Delphi XE, Delphi XE2, Delphi XE3, Development, Encoding, Shift JIS, Software Development, Unicode, UTF-8, UTF8 | 1 Comment »
Posted by jpluimers on 2013/03/01
A few links I came across recently:
–jeroen
Posted in About, Development, Encoding, EPS/PostScript, Font, internatiolanization (i18n) and localization (l10), Personal, Power User, Programmers Font, Software Development, Unicode | Leave a Comment »
Posted by jpluimers on 2013/01/29
A while ago, I had to adapt a DOS app that used one specific version of Excel to do some batch processing so it would support multiple versions of Excel on multiple versions of Windows.
One of the big drawbacks of DOS applications is that the command lines you can use are even shorter than Windows applications, which depending you how you call an application are:
This is how the DOS app written in Clipper (those were the days, it was even linked with Blinker :) started Excel:
c:\progra~1\micros~2\office11\excel.exe parameters
01234567890123456789012345678901234567890
1 2 3 4
The above depends on 8.3 short file names that in turn depend on the order in which similar named files and directories have been created.
The trick around this, and around different locations/versions of an application, is to use START to find the right version of Excel.
The reason it works is because in addition to PATH, it checks the App Paths portions in the registry in this order to find an executable: Read the rest of this entry »
Posted in Batch-Files, Development, Encoding, Power User, Scripting, Software Development, Unicode, Windows, Windows 7, Windows 8, Windows Server 2000, Windows Server 2003, Windows Server 2003 R2, Windows Server 2008, Windows Server 2008 R2, Windows Vista, Windows XP | Leave a Comment »
Posted by jpluimers on 2013/01/08
When developing in multiple languages, it sometimes is funny to see how they differ in compiler oddities.
Below are a few on const examples.
Basically, in C# you cannot go from a char const to a string const, and chars are a special kind of int.
In Delphi you cannot go from a string to a char. Read the rest of this entry »
Posted in .NET, ASCII, C#, C# 1.0, C# 2.0, C# 3.0, C# 4.0, C# 5.0, Delphi, Delphi 2009, Delphi 2010, Delphi XE, Delphi XE2, Development, Encoding, Software Development, Unicode | Leave a Comment »
Posted by jpluimers on 2012/11/20
A while ago I had a “duh” moment while calling a method that had many overloads, and one of the overloads was using int, not the char I’d expect.
The result was that a default value for that char was used, and my parameter was interpreted as a (very small) buffer size. I only found out something went wrong when writing unit tests around my code.
The culprit is this C# char feature (other implicit type conversions nicely summarized by Muhammad Javed):
A char can be implicitly converted to ushort, int, uint, long, ulong, float, double, or decimal. However, there are no implicit conversions from other types to the char type.
Switching between various development environments, I totally forgot this is the case in languages based on C and Java ancestry. But not in VB and Delphi ancestry (C/C++ do numeric promotions of char to int and Java widens 2-byte char to 4-byte int; Delphi and VB.net don’t).
I’m not the only one who was confused, so Eric Lippert wrote a nice blog post on it in 2009: Why does char convert implicitly to ushort but not vice versa? – Fabulous Adventures In Coding – Site Home – MSDN Blogs.
Basically, it is the C ancestry: a char is an integral type always known to contain an integer value representing a Unicode character. The opposite is not true: an integer type is not always representing a Unicode character.
Lesson learned: if you have a large number of overloads (either writing them or using them) watch for mixing char and int parameters.
Note that overload resolution can be diffucult enough (C# 3 had breaking changes and C# 4 had breaking changes too, and those are only for C#), so don’t make it more difficult than it should be (:
Below a few examples in C# and VB and their IL disassemblies to illustrate their differnces based on asterisk (*) and space ( ) that also show that not all implicits are created equal: Decimal is done at run-time, the rest at compile time.
Note that the order of the methods is alphabetic, but the calls are in order of the type and size of the numeric types (integral types, then floating point types, then decimal).
A few interesting observations:
- The C# compiler implicitly converts char with all calls except for decimal, where an implicit conversion at run time is used:
L_004c: call valuetype [mscorlib]System.Decimal [mscorlib]System.Decimal::op_Implicit(char)
L_0051: call void CharIntCompatibilityCSharp.Program::writeLineDecimal(valuetype [mscorlib]System.Decimal)
- Same for implicit conversion of byte to the other types, though here the C# and VB.NET compilers generate slightly different code for run-time conversion.
C# uses an implicit conversion:
L_00af: ldloc.1
L_00b0: call valuetype [mscorlib]System.Decimal [mscorlib]System.Decimal::op_Implicit(uint8)
L_00b5: call void CharIntCompatibilityCSharp.Program::writeLineDecimal(valuetype [mscorlib]System.Decimal)
VB.NET calls a constructor:
L_006e: ldloc.1
L_006f: newobj instance void [mscorlib]System.Decimal::.ctor(int32)
L_0075: call void CharIntCompatibilityVB.Program::writeLineDecimal(valuetype [mscorlib]System.Decimal)
Here is the example code: Read the rest of this entry »
Posted in .NET, Agile, Algorithms, C#, C# 1.0, C# 2.0, C# 3.0, C# 4.0, C# 5.0, C++, Delphi, Development, Encoding, Floating point handling, Java, Software Development, Unicode, Unit Testing, VB.NET | 1 Comment »
Posted by jpluimers on 2012/07/26
While reviewing some client’s code, I noticed they were generating and parsing XML and HTML by hand (do not ever do that yourself!).
Before refactoring this into something that uses libraries that properly understand XML and HTML, I needed to assess some of the risks.
A major risk is to get the escaping (and unescaping) of XML and HTML right.
Time to finally organize some of my links on escaping HTML and XML that I had in my favourites list.
The starting point is the List of XML and HTML character entity references on Wikipedia. It is readable, complete and lists both kinds of escapes.
XML escapes
The official W3C text that describes XML escaping is hard to read.
There are only 5 predefined XML entities for characters that can (some must) be escaped. This table is derived from the Wikipedia article.
HTML escapes
Read the rest of this entry »
Posted in " quot, & amp, > gt, < lt, ' apos, ASCII, Development, Encoding, HTML, Power User, SocialMedia, Software Development, Unicode, Web Development, WordPress, XML, XML escapes, XML/XSD | 1 Comment »