The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 1,860 other subscribers

Archive for the ‘Encoding’ Category

When someone writes UTF-8 and UTF-16 strings to the same file in binary format without converting between them…

Posted by jpluimers on 2017/06/21

A while ago, I had to fix some stuff in an application that would write – using a binary mechanism – UTF-8 and UTF-16 strings (part of it XML in various flavours)  to the same byte stream without converting between the two encodings.

Some links that helped me investigate what was wrong, choose what encoding to use for storage and fix it:

–jeroen

Posted in Delphi, Delphi 10 Seattle, Delphi 10.1 Berlin (BigBen), Delphi XE8, Development, Encoding, Software Development, UTF-16, UTF-8, UTF16, UTF8, XML, XML/XSD | 3 Comments »

How can I get the default code page for a locale? – The Old New Thing

Posted by jpluimers on 2017/06/20

Ask GetLocaleInfo (example function GetAnsiCodePageForLocale included): [WayBackHow can I get the default code page for a locale? – The Old New Thing

UINT GetAnsiCodePageForLocale(LCID lcid)
{
  UINT acp;
  int sizeInChars = sizeof(acp) / sizeof(TCHAR);
  if (GetLocaleInfo(lcid,
                    LOCALE_IDEFAULTANSICODEPAGE |
                    LOCALE_RETURN_NUMBER,
                    reinterpret_cast<LPTSTR>(&acp),
                    sizeInChars) != sizeInChars) {
    // Oops - something went wrong
  }
  return acp;
}

And even though you didn’t ask, you can use LOCALE_IDEFAULT­CODE­PAGE to get the OEM code page for a locale.

Bonus gotchaThere are a number of locales that are Unicode-only. If you ask the Get­Locale­Info function and ask for their ANSI and OEM code pages, the answer is “Um, I don’t have one.” (You get zero back.)

Related:

–jeroen

Posted in Development, Encoding, internatiolanization (i18n) and localization (l10), Software Development, The Old New Thing, Windows Development, Windows-1252 | 2 Comments »

Some notes on stripping NULL characters and BOMs from files

Posted by jpluimers on 2017/05/31

A while ago I bumped into applications that write alternating UTF-16 and UTF-8 to files without checking what type of encoding the files were using.

So here are some notes to at least save some of the contents.

TODO: figure out how to strip the BOM.

–jeroen

Posted in Development, Encoding, Software Development, UTF-16, UTF-8, UTF16, UTF8 | Leave a Comment »

git encoding trouble: recursively removing a directory where git prints out a different name than it accepts

Posted by jpluimers on 2017/05/11

The story so far:

A few years back I put all my conferences material in a GitHub repository https://github.com/jpluimers/Conferences/. There were a lot directories and files so I didn’t pay much attention to the initial check-in list. The files had been part of copy.com syncing between Windows and Mac machines.

Often git on a Mac is a bit easier than on Windows (on a Mac you can install them with the xcode-select --install trick which installs only the Command Line Tools without having to install the full Xcode [WayBack]).

I choose a Mac because it is closer to a Linux machine than Widows so I expected no encoding trouble (as git has a Linux origin: it “was created by Linus Torvalds in 2005 for development of the Linux kernel“).

Boy I was wrong:

Recently I cloned the repository in a different place and found out a few strange things:

  1. Directories with accented characters had been duplicated, for instance in https://github.com/jpluimers/Conferences/tree/master/2011
    1. …/EKON15-Renaissance-Hotel-D%FCsseldorf
    2. …/EKON15-Renaissance-Hotel-Düsseldorf
  2. Beyond Compare would show the same content
  3. After a check-out git would not understand the %FC encoded directory name (%FC is IEC_8859-1 encoding for ü and \374 is the octal representation of 0xFC [WayBack]) and a git status would show stuff like this:
    • Untracked files:
        (use "git add ..." to include in what will be committed)
      
          EKON15-Renaissance-Hotel-D%FCsseldorf/

      or

      deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Debugging/BO-EKON15-Delphi-XE2-Debugging.pdf"
  4. A git rm -r --cached call [WayBack] would not work, as both these would fail:
    • $ git rm -r --cached EKON15-Renaissance-Hotel-D%FCsseldorf
      fatal: pathspec 'EKON15-Renaissance-Hotel-D%FCsseldorf' did not match any files
      

      and

      $ git rm -r --cached "EKON15-Renaissance-Hotel-D\374sseldorf"
      fatal: pathspec 'EKON15-Renaissance-Hotel-D\374sseldorf' did not match any files
      
  5. a

So git could:

  • detect the directories and files
  • display the names of the detected directories and files
  • not translate back the specified names into directories and files

All if this was with:

$ git --version
git version 1.9.5 (Apple Git-50.3)

This is how I fixed it

First I created an alias:

alias git-config="echo global: ; git config --list --global ; echo local: ; git config --lis --local ; echo system: ; git config --list --system"

That allowed me to view the git settings on various levels in my system.

It revealed I didn’t have the core.precomposeunicode setting at all (valid values are true or false). I also read various stories about one or both being the correct value: osx – Git and the Umlaut problem on Mac OS X – Stack Overflow [WayBack].

 

 

–jeroen

Result of git status:


$ git status .
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
(use "git add/rm <file>…" to update what will be committed)
(use "git checkout — <file>…" to discard changes in working directory)
deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Debugging/BO-EKON15-Delphi-XE2-Debugging.pdf"
deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Unit-Testing/BO-EKON15-Delphi-XE2-Unit-Testing.pdf"
deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Workshop/BO-EKON15-2011-XE2-Wokshop-0-sample-code.txt"
deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Workshop/BO-EKON15-2011-XE2-Wokshop-1-Delphi-64bit.pdf"
deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Workshop/BO-EKON15-2011-XE2-Wokshop-2-LiveBindings-DataBinding.pdf"
deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Workshop/BO-EKON15-2011-XE2-Wokshop-3-Delphi-VCL Styles.pdf"
deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Workshop/BO-EKON15-2011-XE2-Wokshop-4-Delphi-FireMonkey.pdf"
deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Workshop/BO-EKON15-2011-XE2-Wokshop-5-Delphi-FireMonkey-xPlatform.pdf"
deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-and-XML/BO-EKON15-2011-Delphi-XE2-and-XML.pdf"
deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/XSL-transforming-XML/BO-EKON15-2011-XSL-transforming-XML.pdf"

 

Posted in Development, DVCS - Distributed Version Control, Encoding, git, ISO-8859, Software Development, Source Code Management | Leave a Comment »

Applications that scale badely on High-DPI Displays: How to Stop the Madness – via: SQLServerCentral

Posted by jpluimers on 2017/05/10

Many applications still scale badly on High-DPI displays: dialogs way too small, icons you need a microscope for, etc.

SSMS in High-DPI Displays: How to Stop the Madness – SQLServerCentral explains a great trick that works for many applications, for intance:

The trick comes down to enabling the PreferExternalManifest registry setting and then create a manual manifest for the application that forces the application to use “bitmap scaling” by basically telling it does not support “XP style DPI scaling”.

You name manifest file named after the exe and stored it in the same directory as the exe.

After that, you also have to rename the exe to a temporary name and then back in order to refresh the cache.

A quote from the trick:

In Windows Vista, you had two possible ways of scaling applications: with the first one (the default) applications were instructed to scale their objects using the scaling factor imposed by the operating system. The results, depending on the quality of the application and the Windows version, could vary a lot. Some scaled correctly, some other look very similar to what we are seeing in SSMS, with some weird-looking GUIs. In Vista, this option was called “XP style DPI scaling”.

The second option, which you could activate by unchecking the “XP style” checkbox, involved drawing the graphical components of the GUI to an off-screen buffer and then drawing them back to the display, scaling the whole thing up to the screen resolution. This option is called “bitmap scaling” and the result is a perfectly laid out GUI.

In order to enable this option in Windows 10, you need to merge this key to your registry:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\SideBySide]
"PreferExternalManifest"=dword:00000001

Then, the application has to be decorated with a manifest file that instructs Windows to disable DPI scaling and enable bitmap scaling, by declaring the application as DPI unaware. The manifest file has to be saved in the same folder as the executable (ssms.exe) and its name must be ssms.exe.manifest. In this case, for SSMS 2014, the file path is “C:\Program Files (x86)\Microsoft SQL Server\120\Tools\Binn\ManagementStudio\Ssms.exe.manifest”.

Paste this text inside the manifest file and save it in UTF8 encoding:


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0" xmlns:asmv3="urn:schemas-microsoft-com:asm.v3">
<dependency>
<dependentAssembly>
<assemblyIdentity type="win32" name="Microsoft.Windows.Common-Controls" version="6.0.0.0" processorArchitecture="*" publicKeyToken="6595b64144ccf1df" language="*">
</assemblyIdentity>
</dependentAssembly>
</dependency>
<dependency>
<dependentAssembly>
<assemblyIdentity type="win32" name="Microsoft.VC90.CRT" version="9.0.21022.8" processorArchitecture="amd64" publicKeyToken="1fc8b3b9a1e18e3b">
</assemblyIdentity>
</dependentAssembly>
</dependency>
<trustInfo xmlns="urn:schemas-microsoft-com:asm.v3">
<security>
<requestedPrivileges>
<requestedExecutionLevel level="asInvoker" uiAccess="false"/>
</requestedPrivileges>
</security>
</trustInfo>
<asmv3:application>
<asmv3:windowsSettings xmlns="http://schemas.microsoft.com/SMI/2005/WindowsSettings"&gt;
<ms_windowsSettings:dpiAware xmlns:ms_windowsSettings="http://schemas.microsoft.com/SMI/2005/WindowsSettings">false</ms_windowsSettings:dpiAware&gt;
</asmv3:windowsSettings>
</asmv3:application>
</assembly>

This “Vista style” bitmap scaling is very similar to what Apple is doing on his Retina displays, except that Apple uses a different font rendering algorithm that looks better when scaled up. If you use this technique in Windows, ClearType rendering is performed on the off-screen buffer before upscaling, so the final result might look a bit blurry.The amount of blurriness you will see depends on the scale factor you set in the control panel or in the settings app in Windows 10. Needless to say that exact pixel scaling looks better, so prefer 200% over 225% or 250% scale factors, because there is no such thing as “half pixel”.

–jeroen

Source: SSMS in High-DPI Displays: How to Stop the Madness – SQLServerCentral

Posted in Database Development, Delphi, Development, Eclipse IDE, Encoding, Java, Java Platform, Software Development, SQL, SQL Server, SSMS SQL Server Management Studio, UTF-8, UTF8 | 4 Comments »

ext3 – How to tell the language encoding of a filename on Linux? – Server Fault

Posted by jpluimers on 2017/05/08

From ext3 – How to tell the language encoding of a filename on Linux? – Server Fault  [WayBack] I learned a few things:

  • filename encoding on Linux is undetermined – the file system just assumes a byte array of characters
  • FTP and SFTP suffer from this as well (SFTP is based on SSH which now prefers UTF-8 [WayBack])

A good default is UTF-8, but it’s never guaranteed.

Two tools can help to determine the encoding of a filename:

  • convmv [WayBack] converts filenames from one encoding to another
  • chardet (Python) The Universal Character Encoding Detector

–jeroen

Posted in *nix, *nix-tools, Development, Encoding, Power User, Software Development, UTF-8, UTF8 | Leave a Comment »

Dark corners of Unicode / fuzzy notepad

Posted by jpluimers on 2017/04/20

You think you know Unicode? Think again, then read [Wayback] Dark corners of Unicode / fuzzy notepad.

On basics, sorting, comparison, decomposition, composition, width, whitespace, encoding, emoji, interesting code planes and dark corners. Lots of dark corners.

The examples are in Python, but hold for almost any programming language

–jeroen

via: Kristian Köhntopp

Posted in Conference Topics, Conferences, Development, Encoding, Event, Software Development, Unicode | Leave a Comment »

HashLib4Pascal is a Delphi/FPC compatible library that provides an easy to use interface for computing hashes and checksums of strings (with a specified encoding), files, streams, byte arrays and untyped data to mention but a few.

Posted by jpluimers on 2017/02/15

One day I will need lots of hashing in Delphi: Xor-el/HashLib4Pascal: HashLib4Pascal is a Delphi/FPC compatible library that provides an easy to use interface for computing hashes and checksums of strings (with a specified encoding), files, streams, byte arrays and untyped data to mention but a few. [WayBack]

via: Hello all,I made a port of “HashLib” (http://hashlib.codeplex.com/) “with some fixes, additions and modifications” for Delphi (2010 ( I hope ) and above)… – Ugochukwu Mmaduekwe – Google+ [WayBack]

It’s a port of the C# HashLib – Home [WayBack]

Another fork is at https://github.com/bonesoul/HashLib

–jeroen

Posted in .NET, C#, Delphi, Delphi 10 Seattle, Delphi 10.1 Berlin (BigBen), Delphi 2010, Delphi XE, Delphi XE2, Delphi XE3, Delphi XE4, Delphi XE5, Delphi XE6, Delphi XE7, Delphi XE8, Development, Encoding, FreePascal, Pascal, Software Development | Leave a Comment »

Coping with UTF-16 / UCS-2 little endian in Batch files: numbers from WMIC

Posted by jpluimers on 2016/11/22

A while ago, I needed to get the various date, time and week values from WMIC to environment variables with pre-padded zeros. I thought: easy job, just write a batch file.

Tough luck: I couldn’t get the values to expand properly. Which in the end was caused by WMIC emitting UTF-16 and the command-interpreter not expecting double-byte character sets which messed up my original batch file.

What I wanted What I got
wmic_Day=21
wmic_DayOfWeek=04
wmic_Hour=15
wmic_Milliseconds=00
wmic_Minute=02
wmic_Month=05
wmic_Quarter=02
wmic_Second=22
wmic_WeekInMonth=04
wmic_Year=2015
Day=21
wmic_DayOfWeek=4
wmic_Hour=15
wmic_Milliseconds=
wmic_Minute=4
wmic_Month=5
wmic_Quarter=2
wmic_Second=22
wmic_WeekInMonth=4
wmic_Year=2015

WMIC uses this encoding because the Wide versions of Windows API calls use UTF-16 (sometimes called UCS-2 as that is where UTF-16 evolved from).

As Windows uses little-endian encoding by default, the high byte (which is zero) of a UTF-16 code point with ASCII characters comes first. That messes up the command interpreter.

Lucikly rojo was of great help solving this.

His solution is centered around set /A, which:

  • handles integer numbers and calls them “numeric” (hinting floating point, but those are truncated to integer; one of the tricks rojo uses)
  • and (be careful with this as 08 and 09 are not octal numbers) uses these prefixes:
    • 0 for Octal
    • 0x for hexadecimal

Enjoy and shiver with the online help extract:
Read the rest of this entry »

Posted in Algorithms, Batch-Files, Development, Encoding, Floating point handling, Scripting, Software Development, UCS-2, UTF-16, UTF16 | Leave a Comment »

Encoding is hard… so how did the single quote become a circumflexed a followed by Euro sign and trade mark?

Posted by jpluimers on 2016/10/04

A while ago (in fact more than a year), I posted Encoding is hard…  go G+ with the below picture.

[Wayback] ftfy (“fixes text for you”, a parody on “fixed that for you”) [Wayback] fixes it, but:

How did the single quote become “’“?

Actually, because of a a common “beautification” of many Office suites (Microsoft and Open alike), the single quote was a special one: a Unicode Character ‘RIGHT SINGLE QUOTATION MARK’ (U+2019) which in UTF-8 is encoded as 0xE2 0x80 0x99.

Read the rest of this entry »

Posted in Development, Encoding, ftfy, ISO-8859, ISO8859, Mojibake, Software Development, Unicode, UTF-8, UTF8, Windows-1252 | Leave a Comment »