May 2026
M	T	W	T	F	S	S
	1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Archive for the ‘EBCDIC’ Category

UTF-8, Explained Simply – YouTube

Posted by jpluimers on 2026/03/04

Cool interesting video: [Wayback/Archive] UTF-8, Explained Simply – YouTube

It covers both history from the late 1800s Baudot Code (also known as ITA1) via 1930s ITA2 and 1950’s EBCDIC / FIELDATA ages through 7-bit ASCII in the 1970s and incompatible UCS-2 (now UTF-16) of the 1990s to the current day and age of UTF-8 (which actually started out on a placemat in 1992).

Though mentioning 8-bit encoding, it skips details of extended ASCII encodings like ISO/IEC 8859 and Windows-1252.

It goes to quite some length on decoding UTF-8 and showing how forgiving the UTF-8 standard is. Yes, it is a self-synchronising code thanks to the venerable Ken Thompson.

Definitely worth watching as it also covers the Zero-width joiner which is not just important for combining Emoji, as it is used by many people nowadays, but got in fact implemented to support various scripts like Arabic script or any Indic script.

Oh, the placemat story: Read the rest of this entry »

Posted in ASCII, Development, EBCDIC, Encoding, ISO-8859, Software Development, UCS-2, Unicode, UTF-16, UTF-8, Windows-1252 | Leave a Comment »

Some interesting encoding/Unicode/text articles on kunststube and links for test files of various encodings

Posted by jpluimers on 2016/08/17

After yesterdays post on Testing and static methods don’t go well together, I read around on Source (kunststube [WayBack]) a bit more and found these very nice articles on encoding,Unicode and text:

Related on those, some other nice readings:

Is there a set of “Lorem ipsums” files for testing character encoding issues? – Stack Overflow [WayBack]
International Components for Unicode: ICU User Guide [WayBack]
- International Components for Unicode ː Repository Browser: repos: icu/data/trunk/charset/data/ucm

ftp://ftp.unicode.org/Public/MAPPINGS [WayBack]

Notes on contents of the MAPPING directory:
EASTASIA:
    This directory is obsolete.
ETSI:
    ETSI GSM 03.38 7-bit default alphabet mapping.
ISO8859:
    These are the mapping tables of the ISO 8859 series (1 - 16).
OBSOLETE:
    Obsolete and unsupported mapping tables for historical
    and archival purposes only.
VENDORS:
    Miscellaneous mapping tables for small codesets, typically provided
    by vendors. The majority of current, useful tables are here.

–jeroen

Posted in Ansi, ASCII, CP437/OEM 437/PC-8, Development, EBCDIC, Encoding, ISO-8859, ISO8859, Shift JIS, Software Development, Unicode, UTF-16, UTF-8, UTF16, UTF8, Windows-1252 | Leave a Comment »

If you think CSV is easy; think again!

Posted by jpluimers on 2012/12/05

Lots of people think CSV is easy: it’s just a bunch of values separated with commas. But in practice it is not. Various reasons can make CSV very hard, especially since “CSV” is not a single, well-defined format. As always importing is always harder than exporting. A few reasons that make it hard:

Comma is often not the separator
The separator can be inside a field as well, so you need some form of quoting the separator
If quotes are in the fields, you need to have some form of escaping these quotes (usually done by doubling the double quotes or doubling the single quotes or quoting commas and newlines)
What kind of quotes to you want (especially when you want to embed the CSV into an XML or HTML attribute)
Do you allow for newlines (and if yes: what kind of newline representation: CRLF, LF, CR, LFCR?) Some solve this by replacing newline representations with spaces, but that is not always a good idea.
Encoding? What encoding: everything is EBCDIC, right?

A few links that helped me a lot getting input and output of CSV right in C#:

c# – Creating a DataTable from CSV File – Stack Overflow.
JoshClose/CsvHelper · GitHub.
CSV TextfieldParser parsing through lines in CSV File with Single and Double Quotes, while skipping lines of unwanted data.
Though be careful with TextFieldParser problem with Double Quotes with Quotes.
c# – Creating a comma separated list from IList or IEnumerable – Stack Overflow.
(simple solution when you know no strange characters are involved).
c# – How to split csv whose columns may contain , – Stack Overflow.
A Fast CSV Reader – CodeProject.

Thanks to Jabulaza:

–jeroen

via: Comma-separated values – Wikipedia, the free encyclopedia.

Posted in Conference Topics, Conferences, CSV, Development, EBCDIC, Encoding, Event, Software Development | 4 Comments »

Some words on Unicode in Windows (Delphi, .NET, APIs, etc)

Posted by jpluimers on 2012/04/05

O'Reilly book "Unicode Explained: Internationalize Documents, Programs, and Web Sites"

Withe the growing integration between systems, and the mismatch between those that support Unicode and that do not, I find that a lot of organisations lack basic Unicode knowledge.

So lets put down a few things, that helps as a primer and gets some confusion out of the way.

Please read the article on Unicode by Joel on Software, and the book Unicode Explained. The book is from 1996, and still very valid.

Unicode

Unicode started in the late 80s of last century as a 16-bit character model.

Somehow lots of people still thing Unicode is a 16-bit double-byte character set. It is not. It uses a variable width encoding for storage.

All encodings except the 32-bit ones are variable width. The UTF-16 encoding is a variable width encoding where each code point (not character!, see below why) takes one or more 16-bit words.

This is because – as of Unicode version 2.0 in 1996 – a surrogate character mechanism was introduced to be able to have more than 64k code points.

The architecture of Unicode is completely different than traditional single-byte character sets or double-byte character sets.

In Unicode, there is a distinction between code points (the mapping of the character to an actual IDs), storage/encoding (in Windows now uses UTF-16LE which includes the past used UCS-2) and leaves visual representation (glyphs/renderings) to fonts.

Unicode has over a million code points, logically divided into 17 planes, of which the Basic Multi-lingual Plane has code points that can be encoded into one 16-bit word.

There is no font that can display all Unicode code points. By original aim, the first 256 Unicode code points are identical to the ISO 8859-1 character set (which is Windows-29591, not Windows-1252!) for which most fonts can display most characters.

By now, you probably grasp that Unicode is not an easy thing to get right. And that can be hard, hence people love and hate Unicode at the same time. Maybe I should get the T-Shirt :).

One thing that complexes things, is that Unicode allows for both composite characters and ready made composites. This is one form where different sequences can be equivalent, so there can be Unicode equivalence for which you need some knowledge on Unicode Normalization (be sure to read this StackOverflow question and this article by Michael Kaplan on Unicode Normalization).

There are many Unicode encodings, of which UTF-8 and UTF-16 are the most widely used (and are variable length). UTF-32 is fixed length. All 16-bit and 32-bit encodings can have big-endian and little-endian storage and can use a Byte Order Mark (BOM) to indicate their endinaness. Not all software uses BOMs, and there are BOMs for UTF-8 and other encodings as well (for UTF-8 it is not recommended to include a BOM).

When only parts your development environment supports Unicode strings, you need to be aware of which do and which don’t. For any interface boundary between those, you need to be aware of potential data loss, and need to decide how to cope with that.

For instance, does your database use Unicode or not for character storage? (For Microsoft SQL Server: do you use CHAR/VARCHAR or NCHAR/NVARCHAR; you should aim for NVARCHAR, yes you really should, do not use text, ntext and image). What do you do while transferring Unicode and non-Unicode text to it? Ask the same questions for Web Services, configuration files, binary storage, message queueing and various other interfaces to the outside world.

The Windows API is almost exclusively Unicode (see this StackOverflow question for more details)

Delphi and Unicode

Let’s focus a bit on Delphi now, as that the migration towards Unicode at clients raised a few questions over the last couple of months.

One of the key questions is why there are no conversion tools that help you migrate your existing source code to fully embrace Unicode.

The short answer is: because you can’t automate the detection of intent in your codebase.

The longer answer starts with that there are tools that detect parts of your Delphi source that potentially has problems: the compiler hints, warnings and errors that brings your attention to spots that are fishy, are likely to fail, or are plain wrong.

Delphi uses the standard Windows storage format for Unicode text: UTF-16LE.

Next to that, Delphi supports conversion to and from UTF-8 en UTF-32 (in their various forms endianness).

External storage of text is best done as UTF-8 because it doesn’t have endianness, and because of easier exchange of text in ISO-8859-1.

Marco Cantu wrote a very nice whitepaper about Delphi and Unicode, and I did a Delphi Unicode talk at CodeRage 4 and posted a lot of Delphi Unicode links at StackOverflow.

A few extra notes on Delphi and Unicode:

With Delphi string types, stick to the UnicodeString (default string as of Delphi 2009) and AnsiString (default string until Delphi 2007) as their memory management is done by Delphi. WideString management is done by COM, so only use that when you really need to. Also avoid ShortString.

For any interfaces to the external world, you need to decide which ones to keep to generic string, Char, PChar and which ones to fix to AnsiChar/PAnsiChar/AnsiString(+ accompanying codepage) or fix at UnicodeChar/PUnicodeChar/UnicodeString.

Of course remnants from the past will catch up with you: if you have Technical Debt on the past where characters were bytes, and you abused Char/PChar/array-of-char/etc you need to fix that, and use the Byte/PByte/TByteArray/PByteArray. It can be costly to pay the accrued debt on that.

–jeroen

PS:

There is even more confusion on character set, code page, etc, which Mihai tries to set straight at the Why is the default console codepage called “OEM”? episode of “The Old New Thing“
Getting your character set (ANSI, Windows-1252, ISO 8859-1) right is a problem of the same order of magnitude as Ben Hutchings shows.
Notepad supports three kinds of text formats

Posted in .NET, C#, Delphi, Development, EBCDIC, Encoding, ISO-8859, Software Development, Technical Debt, Unicode, UTF-8 | 2 Comments »

Comparisons for EBCDIC CCSID 37, 500 and 1047

Posted by jpluimers on 2011/09/20

The referenced article explains the difference in code points between EBCDIC CCSID 37 and EBCDIC CCSID 500, and the difference in code points between EBCDIC CCSID 37 and EBCDIC CCSID 1047:
IBM CCSID Comparisons – United States.

Basically, these are the codepoints that are sensitive:

4A, 4F, 5A, 5F, AD, B0, BA, BB and BD.

–jeroen

Posted in Development, EBCDIC, Encoding, MQ Message Queueing/Queuing, Software Development, WebSphere MQ | Leave a Comment »

	Jeroen Wiert Pluimer… on Arjen Lentz Crystal Ball Vulne…
	Jeroen Wiert Pluimer… on Digitale toegankelijkheid als…
	Jeroen Wiert Pluimer… on Digitale toegankelijkheid als…
	Hermannus Stegeman on Digitale toegankelijkheid als…
	Vereniging NLUUG on Digitale toegankelijkheid als…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘EBCDIC’ Category

UTF-8, Explained Simply – YouTube

Some interesting encoding/Unicode/text articles on kunststube and links for test files of various encodings

Recommended reads when dealing with Character Encodings in software

If you think CSV is easy; think again!

Some words on Unicode in Windows (Delphi, .NET, APIs, etc)

Unicode

Delphi and Unicode

Comparisons for EBCDIC CCSID 37, 500 and 1047

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘EBCDIC’ Category

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Unicode

Delphi and Unicode

Rate this:

Share this:

Rate this:

Share this: