InterBase 2007: UPPER and collations (and a trick to specify the character set for string literals)

December 2009
M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Posted by jpluimers on 2009/12/01

One of the applications that I’m currently involved with will in the future be used in different European countries.

Since it is developed in Delphi 2007, and uses InterBase 2007 we have chosen to use the ISO8859_15 character set: it is a single byte character set very similar to ISO8859_1, but supports the euro sign (€) and some other characters (Š, š, Ž, ž, Œ, œ and Ÿ).

When testing, we found out that UPPER would not always function as we expected.
Hence below some explanation about how UPPER behaves depending on the character sets and collation sequences you specify.

Note: most of this also holds for other versions of InterBase and for FireBird versions, but I have not checked everything with all versions yet, FireBird might be different as this piece of documentation shows.
If anyone does have the results for other InterBase or FireBird versions, please let me know the results.

In addition to having ISO8859_15 as the character set, we also had ISO8859_15 as collation sequence.
However, when testing with diacritics, we found out that the UPPER function in InterBase only works when you have a specific collation.
For generic collations (like NONE or a collation with the same name as the character set) we did not get the results we expected.

For showing you what we observed, lets start with an example table and a record of test data:

CREATE TABLE UPPER_TEST
(
  MIXED_ISO8859_15_ISO8859_15              VARCHAR(    20) CHARACTER SET ISO8859_1  COLLATE ISO8859_15,
  MIXED_ISO8859_15_DU_NL9                  VARCHAR(    20) CHARACTER SET ISO8859_1  COLLATE DU_NL,
  MIXED_ISO8859_15                         VARCHAR(    20) CHARACTER SET ISO8859_1
)
;

INSERT INTO UPPER_TEST
(
  MIXED_ISO8859_15_ISO8859_15,
  MIXED_ISO8859_15_DU_NL9,
  MIXED_ISO8859_15
)
VALUES
(
  'abc ABC ábç ÁBÇ',
  'abc ABC ábç ÁBÇ',
  'abc ABC ábç ÁBÇ'
);

So, all three columns have the same value.

You see that we used the character set ISO8859_15, and collation sequences ISO8859_15, DU_NL (aka DU_NL9, see below), and none at all (which becomes the default collation for the character set used, see the sample query at the very end).

First a bit more about character sets and collaction sequences.
There are quite a few on-line articles on this topic (including this great archived article by Peter Gulutzan – of MySQL and SQL Performance Tuning fame – that used to be on dbazine.com), so I’ll keep this brief.

InterBase 2007 (like many other DBMS systems) has both character sets and collation sequences.
Character sets define how characters are stored: single byte (and if so: which characters), multi-byte (and how) just like character encoding in generic computing.
Collation sequences define much more: sorting order, comparison, expansion, upper casing, etc.

An InterBase database can have a default character set (when omitted it is NONE), but the default collation sequence is implicit: if you do not define a collation sequence it has the same name as the character set.

(Sidenote: you can define your own collation drivers for InterBase).

For each character set, you can only use the collation sequences that belong to it.
This shows the list of character sets and collation sequences defined in your database:

select
  chars.rdb$character_set_name,
  cols.*
from
  rdb$collations cols
left join
  rdb$character_sets chars
on
  chars.rdb$character_set_id = cols.rdb$character_set_id
order by
  cols.rdb$character_set_id, rdb$collation_id

Now lets look at some ways of getting ‘the’ uppercase version of these strings.

select
  upper(_ISO8859_15 'abc ABC ábç ÁBÇ') collate DU_NL9,

  upper(u.MIXED_ISO8859_15),
  upper(u.MIXED_ISO8859_15_DU_NL9),
  upper(u.MIXED_ISO8859_15_ISO8859_15),

  upper(u.MIXED_ISO8859_15)  collate DU_NL9,
  upper(u.MIXED_ISO8859_15_DU_NL9) collate DU_NL,
  upper(u.MIXED_ISO8859_15_ISO8859_15) collate DU_NL9,

  u.MIXED_ISO8859_15,
  u.MIXED_ISO8859_15_DU_NL9,
  u.MIXED_ISO8859_15_ISO8859_15
from
  UPPER_TEST u

These are the results for each line in the query, with a star for each line that has the expected UPPER result:

*  upper(_ISO8859_15 'abc ABC ábç ÁBÇ') collate DU_NL9  ABC ABC ÁBÇ ÁBÇ
   upper(_ISO8859_15 'abc ABC ábç ÁBÇ')                 ABC ABC áBç ÁBÇ
   upper('abc ABC ábç ÁBÇ') collate DU_NL9,             ABC ABC áBç ÁBÇ
   upper('abc ABC ábç ÁBÇ'),                            ABC ABC áBç ÁBÇ
   upper(u.MIXED_ISO8859_15)                            ABC ABC áBç ÁBÇ
*  upper(u.MIXED_ISO8859_15_DU_NL9)                     ABC ABC ÁBÇ ÁBÇ
   upper(u.MIXED_ISO8859_15_ISO8859_15)                 ABC ABC áBç ÁBÇ
   upper(u.MIXED_ISO8859_15)  collate DU_NL9            ABC ABC áBç ÁBÇ
*  upper(u.MIXED_ISO8859_15_DU_NL9) collate DU_NL       ABC ABC ÁBÇ ÁBÇ
   upper(u.MIXED_ISO8859_15_ISO8859_15) collate DU_NL9  ABC ABC áBç ÁBÇ
   u.MIXED_ISO8859_15                                   abc ABC ábç ÁBÇ
   u.MIXED_ISO8859_15_DU_NL9                            abc ABC ábç ÁBÇ
   u.MIXED_ISO8859_15_ISO8859_15                        abc ABC ábç ÁBÇ

First of all, lets explain the character set trick for the string literal on the first line.
It referenced in this newsgroup thread by Craig Stunz (of well known InterBase fame). He tells it is explained in “Understanding data retrieval with SELECT” section of the InterBase “Embedded SQL Guide”.
A PDF of that ships with each InterBase installation, but there are also a couple of them on-line (for instance the InterBase 7.5 version of this guide and an InterBase 2009 version of this guide).
However, these guides only specify the use of the COLLATE keyword, not the character set trick for string literals.
But: it is a bit described in the Interbase 2009 Language Reference on pages 2-18, 2-59, 2-107, 2-120 and 2-134. It talks about charsetname, but never explains that the charsetname needs to be prepended by an underscore character.

When selecting a column, and that column comes from a table, then the column definition in that table determines the character set which is used.
When selecting a string literal, the character set would always be the default one (of either the database or the connection).
But the trick with the underscore-prefixed character set name allows you to specify the character set used for the literal.

Now lets look at the columns that give the correct UPPER results:

*  upper(_ISO8859_15 'abc ABC ábç ÁBÇ') collate DU_NL9  ABC ABC ÁBÇ ÁBÇ
*  upper(u.MIXED_ISO8859_15_DU_NL9)                     ABC ABC ÁBÇ ÁBÇ
*  upper(u.MIXED_ISO8859_15_DU_NL9) collate DU_NL       ABC ABC ÁBÇ ÁBÇ

UPPER gives the correct result only when:

it is a string literal that has both a character set specification and a specific collation sequence
it is a column that was defined with both both a character set specification and a specific collation sequence

So what we learned is that in order to get UPPER to work correctly, you will need a specific collation sequence (i.e. not the default aka generic one).
This particular database is going to be deployed in The Netherlands at first, so we wanted Dutch collation sequence.

This query shows all character sets having a Dutch collation sequence:

select
  chars.rdb$character_set_name,
  cols.*
from
  rdb$collations cols
left join
  rdb$character_sets chars
on
  chars.rdb$character_set_id = cols.rdb$character_set_id
where
  cols.rdb$collation_name like 'DU_NL%'
order by
  cols.rdb$character_set_id, rdb$collation_id

And this is the result: character sets ISO8859_1 and ISO8859_15 can support the Dutch collation sequence.
Note it is named DU_NL or DU_NL9, and that in SQL statements sometimes you need to use the former ot the latter (even if the name does not match the one you’d expect with the character set it belongs to).

RDB$CHARACTER_SET_NAME  RDB$COLLATION_NAME  RDB$COLLATION_ID  RDB$CHARACTER_SET_ID  RDB$COLLATION_ATTRIBUTES  RDB$SYSTEM_FLAG  RDB$DESCRIPTION  RDB$FUNCTION_NAME
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
ISO8859_1               DU_NL               2                 21                                              1
ISO8859_15              DU_NL9              2                 39                                              1

Finally, as promised, the SQL query to show the default collation for each character set has the same name as that character set:

select
  chars.rdb$character_set_name, cols.*
from
  rdb$collations cols
left join
  rdb$character_sets chars
on
  chars.rdb$character_set_id = cols.rdb$character_set_id
where
  cols.rdb$collation_id = 0 /* */
order by
  cols.rdb$character_set_id, rdb$collation_id

This entry was posted on 2009/12/01 at 06:00 and is filed under Database Development, Delphi, Development, Firebird, InterBase. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

	Jeroen Wiert Pluimer… on Pie Comic by John McNamee: Mov…
	Attila Kovacs on Crowbarring Windows 95 into Wi…
	Jeroen Wiert Pluimer… on Does Odido (the old T-Mobile N…
	Lars Fosdal on Security alarm provider Woonve…
	Thomas Mueller on Question got closed in May 202…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription