The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,641 other followers

.NET/C#: from Unicode to ASCII (yes, this is one-way): converting Diacritics to “regular” ASCII characters.

Posted by jpluimers on 2013/06/11

A while ago, I needed to export pure ASCII text from a .NET app.

An important step there is to convert the diacritics to “normal” ASCII characters. That turned out to be enough for this case.

This is the code I used which is based on Extension Methods and this trick from Blair Conrad:

The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the “base” characters from the diacritics) and then scans the result and retains only the base characters. It’s just a little complicated, but really you’re looking at a complicated problem.

Example code:

using System;
using System.Text;
using System.Globalization;

namespace StringToAsciiConsoleApplication
{
    class Program
    {
        static void Main(string[] args)
        {
            string unicode = "áìôüç";
            string ascii = unicode.ToAscii();
            Console.WriteLine("Unicode\t{0}", unicode);
            Console.WriteLine("ASCII\t{0}", ascii);
        }
    }

    public static class StringExtensions
    {
        public static string ToAscii(this string value)
        {
            return RemoveDiacritics(value);
        }

        // http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net
        private static string RemoveDiacritics(this string value)
        {
            string valueFormD = value.Normalize(NormalizationForm.FormD);
            StringBuilder stringBuilder = new StringBuilder();

            foreach (System.Char item in valueFormD)
            {
                UnicodeCategory unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(item);
                if (unicodeCategory != UnicodeCategory.NonSpacingMark)
                {
                    stringBuilder.Append(item);
                }
            }

            return (stringBuilder.ToString().Normalize(NormalizationForm.FormC));
        }
    }
}

–jeroen

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

 
%d bloggers like this: