March 2026
M	T	W	T	F	S	S
	1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Archive for the ‘Encoding’ Category

When someone writes UTF-8 and UTF-16 strings to the same file in binary format without converting between them…

Posted by jpluimers on 2017/06/21

A while ago, I had to fix some stuff in an application that would write – using a binary mechanism – UTF-8 and UTF-16 strings (part of it XML in various flavours) to the same byte stream without converting between the two encodings.

Some links that helped me investigate what was wrong, choose what encoding to use for storage and fix it:

unicode – How to Convert Ansi to UTF 8 with TXMLDocument in Delphi – Stack Overflow
delphi – What should I use? UTF8 or UTF16? – Stack Overflow
IXMLDocument.SaveToStream does not always use UTF-16 encoding | Marc Durdin’s Blog
delphi – Length() vs Sizeof() on Unicode strings – Stack Overflow
TEncoding.UTF8.GetBytes: utf 8 – String to byte array in UTF-8? – Stack Overflow
SysUtils.ByteLength() is not the nicest function, so use only for a quick-fix Delphi Unicode String Length in Bytes – Stack Overflow
AnsiString, UnicodeString et al: String Types (Delphi) – RAD Studio
Question about type identity – delphi
On the type compatibility in Delphi | The Programming Works
Joe White’s Blog » Blog Archive » Grammar details of Delphi’s “type type” feature
Type Compatibility and Identity (Delphi) – RAD Studio
Delphi “type types”: similar types but not the same type identity, some examples.
Converting in various Delphi versions (pre-Unicode and post-Unicode versions): Delphi: Encoding Strings as Python do – Stack Overflow
UTF-8 Conversion Routines – RAD Studio

–jeroen

Posted in Delphi, Delphi 10 Seattle, Delphi 10.1 Berlin (BigBen), Delphi XE8, Development, Encoding, Software Development, UTF-16, UTF-8, UTF16, UTF8, XML, XML/XSD | 3 Comments »

How can I get the default code page for a locale? – The Old New Thing

Posted by jpluimers on 2017/06/20

Ask GetLocaleInfo (example function GetAnsiCodePageForLocale included): [WayBack] How can I get the default code page for a locale? – The Old New Thing

UINT GetAnsiCodePageForLocale(LCID lcid)
{
  UINT acp;
  int sizeInChars = sizeof(acp) / sizeof(TCHAR);
  if (GetLocaleInfo(lcid,
                    LOCALE_IDEFAULTANSICODEPAGE |
                    LOCALE_RETURN_NUMBER,
                    reinterpret_cast<LPTSTR>(&acp),
                    sizeInChars) != sizeInChars) {
    // Oops - something went wrong
  }
  return acp;
}
And even though you didn’t ask, you can use LOCALE_IDEFAULTCODEPAGE to get the OEM code page for a locale.

Bonus gotcha: There are a number of locales that are Unicode-only. If you ask the GetLocaleInfo function and ask for their ANSI and OEM code pages, the answer is “Um, I don’t have one.” (You get zero back.)

–jeroen

Posted in Development, Encoding, internatiolanization (i18n) and localization (l10), Software Development, The Old New Thing, Windows Development, Windows-1252 | 2 Comments »

Some notes on stripping NULL characters and BOMs from files

Posted by jpluimers on 2017/05/31

A while ago I bumped into applications that write alternating UTF-16 and UTF-8 to files without checking what type of encoding the files were using.

So here are some notes to at least save some of the contents.

Powershell: about_Special_Characters (including NULL)
sql server – How to remove NULL char (0x00) from object within PowerShell – Stack Overflow (actually a Powershell line that does the NULL stripping)
regular expressions – Powershell 2: How to strip a specific character from a body of ASCII text – Server Fault
tr: HOWTO remove null characters from a file « Remi Bergsma’s blog

TODO: figure out how to strip the BOM.

–jeroen

Posted in Development, Encoding, Software Development, UTF-16, UTF-8, UTF16, UTF8 | Leave a Comment »

git encoding trouble: recursively removing a directory where git prints out a different name than it accepts

Posted by jpluimers on 2017/05/11

The story so far:

A few years back I put all my conferences material in a GitHub repository https://github.com/jpluimers/Conferences/. There were a lot directories and files so I didn’t pay much attention to the initial check-in list. The files had been part of copy.com syncing between Windows and Mac machines.

Often git on a Mac is a bit easier than on Windows (on a Mac you can install them with the xcode-select --install trick which installs only the Command Line Tools without having to install the full Xcode [WayBack]).

I choose a Mac because it is closer to a Linux machine than Widows so I expected no encoding trouble (as git has a Linux origin: it “was created by Linus Torvalds in 2005 for development of the Linux kernel“).

Boy I was wrong:

Recently I cloned the repository in a different place and found out a few strange things:

Directories with accented characters had been duplicated, for instance in https://github.com/jpluimers/Conferences/tree/master/2011
1. …/EKON15-Renaissance-Hotel-D%FCsseldorf
2. …/EKON15-Renaissance-Hotel-Düsseldorf
Beyond Compare would show the same content

After a check-out git would not understand the %FC encoded directory name (%FC is IEC_8859-1 encoding for ü and \374 is the octal representation of 0xFC [WayBack]) and a git status would show stuff like this:

Untracked files:
  (use "git add ..." to include in what will be committed)

    EKON15-Renaissance-Hotel-D%FCsseldorf/

deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Debugging/BO-EKON15-Delphi-XE2-Debugging.pdf"

A git rm -r --cached call [WayBack] would not work, as both these would fail:

$ git rm -r --cached EKON15-Renaissance-Hotel-D%FCsseldorf
fatal: pathspec 'EKON15-Renaissance-Hotel-D%FCsseldorf' did not match any files

and

$ git rm -r --cached "EKON15-Renaissance-Hotel-D\374sseldorf"
fatal: pathspec 'EKON15-Renaissance-Hotel-D\374sseldorf' did not match any files

So git could:

detect the directories and files
display the names of the detected directories and files
not translate back the specified names into directories and files

All if this was with:

$ git --version git version 1.9.5 (Apple Git-50.3)

This is how I fixed it

First I created an alias:

alias git-config="echo global: ; git config --list --global ; echo local: ; git config --lis --local ; echo system: ; git config --list --system"

That allowed me to view the git settings on various levels in my system.

It revealed I didn’t have the core.precomposeunicode setting at all (valid values are true or false). I also read various stories about one or both being the correct value: osx – Git and the Umlaut problem on Mac OS X – Stack Overflow [WayBack].

–jeroen

Result of git status:

	$ git status .
	On branch master
	Your branch is up-to-date with 'origin/master'.

	Changes not staged for commit:
	(use "git add/rm <file>…" to update what will be committed)
	(use "git checkout — <file>…" to discard changes in working directory)

	deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Debugging/BO-EKON15-Delphi-XE2-Debugging.pdf"
	deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Unit-Testing/BO-EKON15-Delphi-XE2-Unit-Testing.pdf"
	deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Workshop/BO-EKON15-2011-XE2-Wokshop-0-sample-code.txt"
	deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Workshop/BO-EKON15-2011-XE2-Wokshop-1-Delphi-64bit.pdf"
	deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Workshop/BO-EKON15-2011-XE2-Wokshop-2-LiveBindings-DataBinding.pdf"
	deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Workshop/BO-EKON15-2011-XE2-Wokshop-3-Delphi-VCL Styles.pdf"
	deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Workshop/BO-EKON15-2011-XE2-Wokshop-4-Delphi-FireMonkey.pdf"
	deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-Workshop/BO-EKON15-2011-XE2-Wokshop-5-Delphi-FireMonkey-xPlatform.pdf"
	deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/Delphi-XE2-and-XML/BO-EKON15-2011-Delphi-XE2-and-XML.pdf"
	deleted: "EKON15-Renaissance-Hotel-D\374sseldorf/XSL-transforming-XML/BO-EKON15-2011-XSL-transforming-XML.pdf"

view raw

git-status-does-not-like-accented-characters-on-os-x.txt

hosted with ❤ by GitHub

Posted in Development, DVCS - Distributed Version Control, Encoding, git, ISO-8859, Software Development, Source Code Management | Leave a Comment »

Applications that scale badely on High-DPI Displays: How to Stop the Madness – via: SQLServerCentral

Posted by jpluimers on 2017/05/10

Many applications still scale badly on High-DPI displays: dialogs way too small, icons you need a microscope for, etc.

SSMS in High-DPI Displays: How to Stop the Madness – SQLServerCentral explains a great trick that works for many applications, for intance:

SQL Server Management Studio (SSMS)
Eclipse IDE
Various Adobe applications (Fireworks, Photoshop, DreamWeaver, Illustrator
Probably: Delphi

The trick comes down to enabling the PreferExternalManifest registry setting and then create a manual manifest for the application that forces the application to use “bitmap scaling” by basically telling it does not support “XP style DPI scaling”.

You name manifest file named after the exe and stored it in the same directory as the exe.

After that, you also have to rename the exe to a temporary name and then back in order to refresh the cache.

A quote from the trick:

In Windows Vista, you had two possible ways of scaling applications: with the first one (the default) applications were instructed to scale their objects using the scaling factor imposed by the operating system. The results, depending on the quality of the application and the Windows version, could vary a lot. Some scaled correctly, some other look very similar to what we are seeing in SSMS, with some weird-looking GUIs. In Vista, this option was called “XP style DPI scaling”.

The second option, which you could activate by unchecking the “XP style” checkbox, involved drawing the graphical components of the GUI to an off-screen buffer and then drawing them back to the display, scaling the whole thing up to the screen resolution. This option is called “bitmap scaling” and the result is a perfectly laid out GUI.

In order to enable this option in Windows 10, you need to merge this key to your registry:
Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\SideBySide]
"PreferExternalManifest"=dword:00000001
Then, the application has to be decorated with a manifest file that instructs Windows to disable DPI scaling and enable bitmap scaling, by declaring the application as DPI unaware. The manifest file has to be saved in the same folder as the executable (ssms.exe) and its name must be ssms.exe.manifest. In this case, for SSMS 2014, the file path is “C:\Program Files (x86)\Microsoft SQL Server\120\Tools\Binn\ManagementStudio\Ssms.exe.manifest”.

Paste this text inside the manifest file and save it in UTF8 encoding:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0" xmlns:asmv3="urn:schemas-microsoft-com:asm.v3">

<dependency>

<dependentAssembly>

<assemblyIdentity type="win32" name="Microsoft.Windows.Common-Controls" version="6.0.0.0" processorArchitecture="*" publicKeyToken="6595b64144ccf1df" language="*">

</assemblyIdentity>

</dependentAssembly>

</dependency>

<dependency>

<dependentAssembly>

<assemblyIdentity type="win32" name="Microsoft.VC90.CRT" version="9.0.21022.8" processorArchitecture="amd64" publicKeyToken="1fc8b3b9a1e18e3b">

</assemblyIdentity>

</dependentAssembly>

</dependency>

<trustInfo xmlns="urn:schemas-microsoft-com:asm.v3">

<security>

<requestedPrivileges>

<requestedExecutionLevel level="asInvoker" uiAccess="false"/>

</requestedPrivileges>

</security>

</trustInfo>

<asmv3:application>

<asmv3:windowsSettings xmlns="http://schemas.microsoft.com/SMI/2005/WindowsSettings">

<ms_windowsSettings:dpiAware xmlns:ms_windowsSettings="http://schemas.microsoft.com/SMI/2005/WindowsSettings">false</ms_windowsSettings:dpiAware>

</asmv3:windowsSettings>

</asmv3:application>

</assembly>

view raw

ssms.exe.manifest

hosted with ❤ by GitHub

This “Vista style” bitmap scaling is very similar to what Apple is doing on his Retina displays, except that Apple uses a different font rendering algorithm that looks better when scaled up. If you use this technique in Windows, ClearType rendering is performed on the off-screen buffer before upscaling, so the final result might look a bit blurry.The amount of blurriness you will see depends on the scale factor you set in the control panel or in the settings app in Windows 10. Needless to say that exact pixel scaling looks better, so prefer 200% over 225% or 250% scale factors, because there is no such thing as “half pixel”.

–jeroen

Source: SSMS in High-DPI Displays: How to Stop the Madness – SQLServerCentral

Posted in Database Development, Delphi, Development, Eclipse IDE, Encoding, Java, Java Platform, Software Development, SQL, SQL Server, SSMS SQL Server Management Studio, UTF-8, UTF8 | 4 Comments »

ext3 – How to tell the language encoding of a filename on Linux? – Server Fault

Posted by jpluimers on 2017/05/08

From ext3 – How to tell the language encoding of a filename on Linux? – Server Fault [WayBack] I learned a few things:

filename encoding on Linux is undetermined – the file system just assumes a byte array of characters
FTP and SFTP suffer from this as well (SFTP is based on SSH which now prefers UTF-8 [WayBack])

A good default is UTF-8, but it’s never guaranteed.

Two tools can help to determine the encoding of a filename:

convmv [WayBack] converts filenames from one encoding to another
chardet (Python) The Universal Character Encoding Detector

–jeroen

Posted in *nix, *nix-tools, Development, Encoding, Power User, Software Development, UTF-8, UTF8 | Leave a Comment »

Dark corners of Unicode / fuzzy notepad

Posted by jpluimers on 2017/04/20

You think you know Unicode? Think again, then read [Wayback] Dark corners of Unicode / fuzzy notepad.

On basics, sorting, comparison, decomposition, composition, width, whitespace, encoding, emoji, interesting code planes and dark corners. Lots of dark corners.

The examples are in Python, but hold for almost any programming language

–jeroen

via: Kristian Köhntopp

Posted in Conference Topics, Conferences, Development, Encoding, Event, Software Development, Unicode | Leave a Comment »

HashLib4Pascal is a Delphi/FPC compatible library that provides an easy to use interface for computing hashes and checksums of strings (with a specified encoding), files, streams, byte arrays and untyped data to mention but a few.

Posted by jpluimers on 2017/02/15

One day I will need lots of hashing in Delphi: Xor-el/HashLib4Pascal: HashLib4Pascal is a Delphi/FPC compatible library that provides an easy to use interface for computing hashes and checksums of strings (with a specified encoding), files, streams, byte arrays and untyped data to mention but a few. [WayBack]

via: Hello all,I made a port of “HashLib” (http://hashlib.codeplex.com/) “with some fixes, additions and modifications” for Delphi (2010 ( I hope ) and above)… – Ugochukwu Mmaduekwe – Google+ [WayBack]

It’s a port of the C# HashLib – Home [WayBack]

Another fork is at https://github.com/bonesoul/HashLib

–jeroen

Posted in .NET, C#, Delphi, Delphi 10 Seattle, Delphi 10.1 Berlin (BigBen), Delphi 2010, Delphi XE, Delphi XE2, Delphi XE3, Delphi XE4, Delphi XE5, Delphi XE6, Delphi XE7, Delphi XE8, Development, Encoding, FreePascal, Pascal, Software Development | Leave a Comment »

Coping with UTF-16 / UCS-2 little endian in Batch files: numbers from WMIC

Posted by jpluimers on 2016/11/22

A while ago, I needed to get the various date, time and week values from WMIC to environment variables with pre-padded zeros. I thought: easy job, just write a batch file.

Tough luck: I couldn’t get the values to expand properly. Which in the end was caused by WMIC emitting UTF-16 and the command-interpreter not expecting double-byte character sets which messed up my original batch file.

What I wanted What I got

What I wanted	What I got
`wmic_Day=21 wmic_DayOfWeek=04 wmic_Hour=15 wmic_Milliseconds=00 wmic_Minute=02 wmic_Month=05 wmic_Quarter=02 wmic_Second=22 wmic_WeekInMonth=04 wmic_Year=2015`	`Day=21 wmic_DayOfWeek=4 wmic_Hour=15 wmic_Milliseconds= wmic_Minute=4 wmic_Month=5 wmic_Quarter=2 wmic_Second=22 wmic_WeekInMonth=4 wmic_Year=2015`

wmic_Day=21
wmic_DayOfWeek=04
wmic_Hour=15
wmic_Milliseconds=00
wmic_Minute=02
wmic_Month=05
wmic_Quarter=02
wmic_Second=22
wmic_WeekInMonth=04
wmic_Year=2015

Day=21
wmic_DayOfWeek=4
wmic_Hour=15
wmic_Milliseconds=
wmic_Minute=4
wmic_Month=5
wmic_Quarter=2
wmic_Second=22
wmic_WeekInMonth=4
wmic_Year=2015

WMIC uses this encoding because the Wide versions of Windows API calls use UTF-16 (sometimes called UCS-2 as that is where UTF-16 evolved from).

As Windows uses little-endian encoding by default, the high byte (which is zero) of a UTF-16 code point with ASCII characters comes first. That messes up the command interpreter.

Lucikly rojo was of great help solving this.

His solution is centered around set /A, which:

handles integer numbers and calls them “numeric” (hinting floating point, but those are truncated to integer; one of the tricks rojo uses)
and (be careful with this as 08 and 09 are not octal numbers) uses these prefixes:
- 0 for Octal
- 0x for hexadecimal

Enjoy and shiver with the online help extract:
Read the rest of this entry »

Posted in Algorithms, Batch-Files, Development, Encoding, Floating point handling, Scripting, Software Development, UCS-2, UTF-16, UTF16 | Leave a Comment »

Encoding is hard… so how did the single quote become a circumflexed a followed by Euro sign and trade mark?

Posted by jpluimers on 2016/10/04

A while ago (in fact more than a year), I posted Encoding is hard… go G+ with the below picture.

[Wayback] ftfy (“fixes text for you”, a parody on “fixed that for you”) [Wayback] fixes it, but:

How did the single quote become “â€™“?

Actually, because of a a common “beautification” of many Office suites (Microsoft and Open alike), the single quote was a special one: a Unicode Character ‘RIGHT SINGLE QUOTATION MARK’ (U+2019) which in UTF-8 is encoded as 0xE2 0x80 0x99.

Read the rest of this entry »

Posted in Development, Encoding, ftfy, ISO-8859, ISO8859, Mojibake, Software Development, Unicode, UTF-8, UTF8, Windows-1252 | Leave a Comment »

« Previous Entries

Next Entries »

	Attila Kovacs on Crowbarring Windows 95 into Wi…
	Jeroen Wiert Pluimer… on Does Odido (the old T-Mobile N…
	Lars Fosdal on Security alarm provider Woonve…
	Thomas Mueller on Question got closed in May 202…
	Thaddy de Koning on Formulier voor bewindvoerders…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘Encoding’ Category

When someone writes UTF-8 and UTF-16 strings to the same file in binary format without converting between them…

How can I get the default code page for a locale? – The Old New Thing

Some notes on stripping NULL characters and BOMs from files

git encoding trouble: recursively removing a directory where git prints out a different name than it accepts

This is how I fixed it

Applications that scale badely on High-DPI Displays: How to Stop the Madness – via: SQLServerCentral

ext3 – How to tell the language encoding of a filename on Linux? – Server Fault

Dark corners of Unicode / fuzzy notepad

HashLib4Pascal is a Delphi/FPC compatible library that provides an easy to use interface for computing hashes and checksums of strings (with a specified encoding), files, streams, byte arrays and untyped data to mention but a few.

Coping with UTF-16 / UCS-2 little endian in Batch files: numbers from WMIC

Encoding is hard… so how did the single quote become a circumflexed a followed by Euro sign and trade mark?

	<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

	<assembly xmlns="urn:schemas-microsoft-com:asm.v1" manifestVersion="1.0" xmlns:asmv3="urn:schemas-microsoft-com:asm.v3">

	<dependency>
	<dependentAssembly>
	<assemblyIdentity type="win32" name="Microsoft.Windows.Common-Controls" version="6.0.0.0" processorArchitecture="" publicKeyToken="6595b64144ccf1df" language="">
	</assemblyIdentity>
	</dependentAssembly>
	</dependency>

	<dependency>
	<dependentAssembly>
	<assemblyIdentity type="win32" name="Microsoft.VC90.CRT" version="9.0.21022.8" processorArchitecture="amd64" publicKeyToken="1fc8b3b9a1e18e3b">
	</assemblyIdentity>
	</dependentAssembly>
	</dependency>

	<trustInfo xmlns="urn:schemas-microsoft-com:asm.v3">
	<security>
	<requestedPrivileges>
	<requestedExecutionLevel level="asInvoker" uiAccess="false"/>
	</requestedPrivileges>
	</security>
	</trustInfo>

	<asmv3:application>
	<asmv3:windowsSettings xmlns="http://schemas.microsoft.com/SMI/2005/WindowsSettings">
	<ms_windowsSettings:dpiAware xmlns:ms_windowsSettings="http://schemas.microsoft.com/SMI/2005/WindowsSettings">false</ms_windowsSettings:dpiAware>
	</asmv3:windowsSettings>
	</asmv3:application>

	</assembly>

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘Encoding’ Category

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

This is how I fixed it

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this: