The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,860 other subscribers

.NET/C#: from Unicode to ASCII (yes, this is one-way): converting Diacritics to “regular” ASCII characters.

Posted by jpluimers on 2013/06/11

A while ago, I needed to export pure ASCII text from a .NET app.

An important step there is to convert the diacritics to “normal” ASCII characters. That turned out to be enough for this case.

This is the code I used which is based on Extension Methods and this trick from Blair Conrad:

The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the “base” characters from the diacritics) and then scans the result and retains only the base characters. It’s just a little complicated, but really you’re looking at a complicated problem.

Example code:

using System;
using System.Text;
using System.Globalization;

namespace StringToAsciiConsoleApplication
{
    class Program
    {
        static void Main(string[] args)
        {
            string unicode = "áìôüç";
            string ascii = unicode.ToAscii();
            Console.WriteLine("Unicode\t{0}", unicode);
            Console.WriteLine("ASCII\t{0}", ascii);
        }
    }

    public static class StringExtensions
    {
        public static string ToAscii(this string value)
        {
            return RemoveDiacritics(value);
        }

        // http://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net
        private static string RemoveDiacritics(this string value)
        {
            string valueFormD = value.Normalize(NormalizationForm.FormD);
            StringBuilder stringBuilder = new StringBuilder();

            foreach (System.Char item in valueFormD)
            {
                UnicodeCategory unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(item);
                if (unicodeCategory != UnicodeCategory.NonSpacingMark)
                {
                    stringBuilder.Append(item);
                }
            }

            return (stringBuilder.ToString().Normalize(NormalizationForm.FormC));
        }
    }
}

–jeroen

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.