InterBase 2007: UPPER and collations (and a trick to specify the character set for string literals)
Posted by jpluimers on 2009/12/01
One of the applications that I’m currently involved with will in the future be used in different European countries.
Since it is developed in Delphi 2007, and uses InterBase 2007 we have chosen to use the ISO8859_15 character set: it is a single byte character set very similar to ISO8859_1, but supports the euro sign (€) and some other characters (Š, š, Ž, ž, Œ, œ and Ÿ).
When testing, we found out that UPPER would not always function as we expected.
Hence below some explanation about how UPPER behaves depending on the character sets and collation sequences you specify.
Note: most of this also holds for other versions of InterBase and for FireBird versions, but I have not checked everything with all versions yet, FireBird might be different as this piece of documentation shows.
If anyone does have the results for other InterBase or FireBird versions, please let me know the results.
In addition to having ISO8859_15 as the character set, we also had ISO8859_15 as collation sequence.
However, when testing with diacritics, we found out that the UPPER function in InterBase only works when you have a specific collation.
For generic collations (like NONE or a collation with the same name as the character set) we did not get the results we expected.
For showing you what we observed, lets start with an example table and a record of test data:
CREATE TABLE UPPER_TEST ( MIXED_ISO8859_15_ISO8859_15 VARCHAR( 20) CHARACTER SET ISO8859_1 COLLATE ISO8859_15, MIXED_ISO8859_15_DU_NL9 VARCHAR( 20) CHARACTER SET ISO8859_1 COLLATE DU_NL, MIXED_ISO8859_15 VARCHAR( 20) CHARACTER SET ISO8859_1 ) ; INSERT INTO UPPER_TEST ( MIXED_ISO8859_15_ISO8859_15, MIXED_ISO8859_15_DU_NL9, MIXED_ISO8859_15 ) VALUES ( 'abc ABC ábç ÁBÇ', 'abc ABC ábç ÁBÇ', 'abc ABC ábç ÁBÇ' );
So, all three columns have the same value.
You see that we used the character set ISO8859_15, and collation sequences ISO8859_15, DU_NL (aka DU_NL9, see below), and none at all (which becomes the default collation for the character set used, see the sample query at the very end).
First a bit more about character sets and collaction sequences.
There are quite a few on-line articles on this topic (including this great archived article by Peter Gulutzan – of MySQL and SQL Performance Tuning fame – that used to be on dbazine.com), so I’ll keep this brief.
InterBase 2007 (like many other DBMS systems) has both character sets and collation sequences.
Character sets define how characters are stored: single byte (and if so: which characters), multi-byte (and how) just like character encoding in generic computing.
Collation sequences define much more: sorting order, comparison, expansion, upper casing, etc.
An InterBase database can have a default character set (when omitted it is NONE), but the default collation sequence is implicit: if you do not define a collation sequence it has the same name as the character set.
(Sidenote: you can define your own collation drivers for InterBase).
For each character set, you can only use the collation sequences that belong to it.
This shows the list of character sets and collation sequences defined in your database:
select chars.rdb$character_set_name, cols.* from rdb$collations cols left join rdb$character_sets chars on chars.rdb$character_set_id = cols.rdb$character_set_id order by cols.rdb$character_set_id, rdb$collation_id
Now lets look at some ways of getting ‘the’ uppercase version of these strings.
select upper(_ISO8859_15 'abc ABC ábç ÁBÇ') collate DU_NL9, upper(u.MIXED_ISO8859_15), upper(u.MIXED_ISO8859_15_DU_NL9), upper(u.MIXED_ISO8859_15_ISO8859_15), upper(u.MIXED_ISO8859_15) collate DU_NL9, upper(u.MIXED_ISO8859_15_DU_NL9) collate DU_NL, upper(u.MIXED_ISO8859_15_ISO8859_15) collate DU_NL9, u.MIXED_ISO8859_15, u.MIXED_ISO8859_15_DU_NL9, u.MIXED_ISO8859_15_ISO8859_15 from UPPER_TEST u
These are the results for each line in the query, with a star for each line that has the expected UPPER result:
* upper(_ISO8859_15 'abc ABC ábç ÁBÇ') collate DU_NL9 ABC ABC ÁBÇ ÁBÇ
upper(_ISO8859_15 'abc ABC ábç ÁBÇ') ABC ABC áBç ÁBÇ
upper('abc ABC ábç ÁBÇ') collate DU_NL9, ABC ABC áBç ÁBÇ
upper('abc ABC ábç ÁBÇ'), ABC ABC áBç ÁBÇ
upper(u.MIXED_ISO8859_15) ABC ABC áBç ÁBÇ
* upper(u.MIXED_ISO8859_15_DU_NL9) ABC ABC ÁBÇ ÁBÇ
upper(u.MIXED_ISO8859_15_ISO8859_15) ABC ABC áBç ÁBÇ
upper(u.MIXED_ISO8859_15) collate DU_NL9 ABC ABC áBç ÁBÇ
* upper(u.MIXED_ISO8859_15_DU_NL9) collate DU_NL ABC ABC ÁBÇ ÁBÇ
upper(u.MIXED_ISO8859_15_ISO8859_15) collate DU_NL9 ABC ABC áBç ÁBÇ
u.MIXED_ISO8859_15 abc ABC ábç ÁBÇ
u.MIXED_ISO8859_15_DU_NL9 abc ABC ábç ÁBÇ
u.MIXED_ISO8859_15_ISO8859_15 abc ABC ábç ÁBÇ
First of all, lets explain the character set trick for the string literal on the first line.
It referenced in this newsgroup thread by Craig Stunz (of well known InterBase fame). He tells it is explained in “Understanding data retrieval with SELECT” section of the InterBase “Embedded SQL Guide”.
A PDF of that ships with each InterBase installation, but there are also a couple of them on-line (for instance the InterBase 7.5 version of this guide and an InterBase 2009 version of this guide).
However, these guides only specify the use of the COLLATE keyword, not the character set trick for string literals.
But: it is a bit described in the Interbase 2009 Language Reference on pages 2-18, 2-59, 2-107, 2-120 and 2-134. It talks about charsetname, but never explains that the charsetname needs to be prepended by an underscore character.
When selecting a column, and that column comes from a table, then the column definition in that table determines the character set which is used.
When selecting a string literal, the character set would always be the default one (of either the database or the connection).
But the trick with the underscore-prefixed character set name allows you to specify the character set used for the literal.
Now lets look at the columns that give the correct UPPER results:
* upper(_ISO8859_15 'abc ABC ábç ÁBÇ') collate DU_NL9 ABC ABC ÁBÇ ÁBÇ * upper(u.MIXED_ISO8859_15_DU_NL9) ABC ABC ÁBÇ ÁBÇ * upper(u.MIXED_ISO8859_15_DU_NL9) collate DU_NL ABC ABC ÁBÇ ÁBÇ
UPPER gives the correct result only when:
- it is a string literal that has both a character set specification and a specific collation sequence
- it is a column that was defined with both both a character set specification and a specific collation sequence
So what we learned is that in order to get UPPER to work correctly, you will need a specific collation sequence (i.e. not the default aka generic one).
This particular database is going to be deployed in The Netherlands at first, so we wanted Dutch collation sequence.
This query shows all character sets having a Dutch collation sequence:
select chars.rdb$character_set_name, cols.* from rdb$collations cols left join rdb$character_sets chars on chars.rdb$character_set_id = cols.rdb$character_set_id where cols.rdb$collation_name like 'DU_NL%' order by cols.rdb$character_set_id, rdb$collation_id
And this is the result: character sets ISO8859_1 and ISO8859_15 can support the Dutch collation sequence.
Note it is named DU_NL or DU_NL9, and that in SQL statements sometimes you need to use the former ot the latter (even if the name does not match the one you’d expect with the character set it belongs to).
RDB$CHARACTER_SET_NAME RDB$COLLATION_NAME RDB$COLLATION_ID RDB$CHARACTER_SET_ID RDB$COLLATION_ATTRIBUTES RDB$SYSTEM_FLAG RDB$DESCRIPTION RDB$FUNCTION_NAME ----------------------------------------------------------------------------------------------------------------------------------------------------------------- ISO8859_1 DU_NL 2 21 1 ISO8859_15 DU_NL9 2 39 1
Finally, as promised, the SQL query to show the default collation for each character set has the same name as that character set:
select chars.rdb$character_set_name, cols.* from rdb$collations cols left join rdb$character_sets chars on chars.rdb$character_set_id = cols.rdb$character_set_id where cols.rdb$collation_id = 0 /* */ order by cols.rdb$character_set_id, rdb$collation_id






Leave a comment