The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My work

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,798 other followers

Archive for the ‘Encoding’ Category

How can I get the default code page for a locale? – The Old New Thing

Posted by jpluimers on 2017/06/20

Ask GetLocaleInfo (example function GetAnsiCodePageForLocale included): How can I get the default code page for a locale? – The Old New Thing

Posted in Development, Encoding, i18n internatiolanization and L10 Localization, Software Development, Windows-1252 | 2 Comments »

Some notes on stripping NULL characters and BOMs from files

Posted by jpluimers on 2017/05/31

A while ago I bumped into applications that write alternating UTF-16 and UTF-8 to files without checking what type of encoding the files were using.

So here are some notes to at least save some of the contents.

TODO: figure out how to strip the BOM.

–jeroen

Posted in Development, Encoding, Software Development, UTF-16, UTF-8, UTF16, UTF8 | Leave a Comment »

Dark corners of Unicode / fuzzy notepad

Posted by jpluimers on 2017/04/20

You think you know Unicode? Think again, then read Dark corners of Unicode / fuzzy notepad.

On basics, sorting, comparison, decomposition, composition, width, whitespace, encoding, emoji, interesting code planes and dark corners. Lots of dark corners.

–jeroen

via: Kristian Köhntopp

Posted in Development, Encoding, Software Development, Unicode | Leave a Comment »

Encoding is hard… so how did the single quote become a circumflexed a followed by Euro sign and trade mark?

Posted by jpluimers on 2016/10/04

A while ago (in fact more than a year), I posted Encoding is hard…  go G+ with the below picture.

ftfy (fixes text for you) fixes it, but:

How did the single quote become “’”?

Actually, because of a a common “beautification” of many Office suites (Microsoft and Open alike), the single quote was a special one: a Unicode Character ‘RIGHT SINGLE QUOTATION MARK’ (U+2019) which in UTF-8 is encoded as 0xE2 0x80 0x99.

Read the rest of this entry »

Posted in Development, Encoding, ISO-8859, ISO8859, Software Development, Unicode, UTF-8, UTF8, Windows-1252 | Leave a Comment »

installing the UTF-8 encoding ftfy (fixes text for you) – via version 3.0 | Luminoso Blog

Posted by jpluimers on 2016/09/06

Simple if you know it:

pip install ftfy

That installs it as a command which is a lot easier than using it from Github at https://github.com/LuminosoInsight/python-ftfy

It knows how to solve the encoding issues in ÃƒÆ’ƒâ€šÃ‚ the future of publishing at W3C.

It didn’t solve my non-Unicode encoding issue: “v3/43/4r” -> “v¾¾r” -> “vóór”.

That was caused by an infamous Western Latin character set confusion issue, in this case ISO-8859/Windows- versus CP850/CP858 encoding issue (so: no Unicode involved at all, nor CP437 as it doesn’t have ¾).

So I put in a suggestion for ftfy to support finding the above.

–jeroen

via

Posted in Development, Encoding, Software Development, Unicode, UTF-8, UTF8 | 4 Comments »

 
%d bloggers like this: