Monday, May 29, 2017

Collation in 12cR2 - when AA equals Å (or not?)

Collation in Oracle 12cR2 gives some more finegrained possibilities for setting up how your data is to sorted and compared linguistically. A lot we could do before with NLS_SORT and NLS_COMP and the likes, but the collation features can both be simpler to use as well as offer more detailed control.

But do some testing for your specific language cases - you might find some small surprises like I did when trying it out with danish.

Here's how I tried danish column level collation...

To explain a bit to the international readers: the danish alphabet has 29 letters, the standard 26 A-Z followed by three special characters Æ, Ø and Å. In the old days, we didn't have the Å and instead the same sound was written AA (double-A). Today words in the language all use Å, but quite a few names (people or places) still use AA where others use Å, and you can't hear the difference when someone pronounces his name whether he spells it Vestergård or Vestergaard.

So for a proper danish sorting of names, AA should be considered identical to the letter Å. Linguistic NLS sorting in Oracle has been capable of this for many versions, collation does not bring any news on this front.

But you can often have a table with some name columns that need to be linguistically sorted and compared, and at the same time some alphabetic key columns (primary or foreign) that need to be binary sorted and compared. You could solve that by putting NLS_SORT in the queries, but with collation you can specify right down to column level how you want the data sorted and compared, so the queries don't need to do anything but just query the data.

For collation to work, I've had to set the database parameter MAX_STRING_SIZE to EXTENDED.

Before the demonstration, I'm first making certain my session is not using Danish NLS sorting (just to be sure not to be accused of cheating ;-) :

alter session set nls_sort = binary
/

For collation, I can use the values that are valid for NLS sorting too. So I can find the possibilities for Danish like this:

select value
  from v$nls_valid_values
 where parameter = 'SORT'
   and value like '%DANISH%'
 order by value
/

VALUE                                                          
----------------------------------------------------------------
DANISH                                                          
DANISH_M                                                        
UCA0610_DANISH                                                  
UCA0620_DANISH                                                  
UCA0700_DANISH                                                  
XDANISH                                                         

A little testing showed I want the DANISH_M value (later I'll show the difference to the DANISH value.)

I create a table specifying the default collation for the table to be BINARY_CI and the specific collation for column CITY to be DANISH_M_CI - this would typify cases where an alphabetic key is used and treated as "generic ascii characters" without any linguistic properties, while the information carrying column is to be considered in a specific language. In both cases the suffix _CI indicates to the database that I want Case Insensitive collation (since data has been entered in a horrible mishmash of upper and lower case.)

create table stores (
   store_id    varchar2(5  char) primary key
 , city        varchar2(20 char) collate danish_m_ci
)
default collation binary_ci
/

insert into stores values ('AA001', 'København')
/
insert into stores values ('AA002', 'Korsør')
/
insert into stores values ('AB001', 'Aarhus')
/
insert into stores values ('AB002', 'Ålestrup')
/
insert into stores values ('BA001', 'karup')
/
insert into stores values ('BA002', 'KYBEHUSE')
/
insert into stores values ('BB001', 'AALBORG')
/
insert into stores values ('BB002', 'andst')
/
insert into stores values ('ÅÅ001', 'AUNING')
/
insert into stores values ('ÅÅ002', 'Kaastrup')
/
commit
/

Column STORE_ID inherits the table default collation, so when I order by STORE_ID, it's sorted with BINARY_CI collation:

select store_id, city
  from stores
 order by store_id
/

STORE CITY               
----- --------------------
AA001 København           
AA002 Korsør              
AB001 Aarhus              
AB002 Ålestrup            
BA001 karup               
BA002 KYBEHUSE            
BB001 AALBORG             
BB002 andst               
ÅÅ001 AUNING              
ÅÅ002 Kaastrup            

But column CITY has a different collation specified, so when I order by CITY, it's sorted with DANISH_M_CI collation. The sorting is case insensitive, and the double-A's are sorted like they were an Å:

select store_id, city
  from stores
 order by city
/

STORE CITY               
----- --------------------
BB002 andst               
ÅÅ001 AUNING              
BA001 karup               
AA002 Korsør              
BA002 KYBEHUSE            
AA001 København           
ÅÅ002 Kaastrup            
BB001 AALBORG             
AB002 Ålestrup            
AB001 Aarhus              

Three cities have been affected by double-A considered identical to Å:
  • Kaastrup sorted as Kåstrup
  • AALBORG sorted as ÅLBORG
  • Aarhus sorted as ÅRHUS
We can inspect the collation used for the columns with COLLATION function:

select store_id, city
     , collation(store_id) store_coll
     , collation(city) city_coll
  from stores
 order by store_id
/

STORE CITY                 STORE_COL CITY_COLL 
----- -------------------- --------- -----------
AA001 København            BINARY_CI DANISH_M_CI
AA002 Korsør               BINARY_CI DANISH_M_CI
AB001 Aarhus               BINARY_CI DANISH_M_CI
AB002 Ålestrup             BINARY_CI DANISH_M_CI
BA001 karup                BINARY_CI DANISH_M_CI
BA002 KYBEHUSE             BINARY_CI DANISH_M_CI
BB001 AALBORG              BINARY_CI DANISH_M_CI
BB002 andst                BINARY_CI DANISH_M_CI
ÅÅ001 AUNING               BINARY_CI DANISH_M_CI
ÅÅ002 Kaastrup             BINARY_CI DANISH_M_CI

The collation on the column can be overruled with the COLLATE operator (as an alternative to sorting by NLSSORT function.) So we can sort by the STORE_ID column in a DANISH_M_CI collation, so the double-A's are sorted like they were a single Å:

select store_id, city
  from stores
 order by store_id collate danish_m_ci
/

STORE CITY               
----- --------------------
AB001 Aarhus              
AB002 Ålestrup            
BA001 karup               
BA002 KYBEHUSE            
BB001 AALBORG             
BB002 andst               
AA001 København           
AA002 Korsør              
ÅÅ001 AUNING              
ÅÅ002 Kaastrup            

Overruling does not change the collation of the column itself, it only is the collation of the expression that is affected:

select store_id, city
     , collation(store_id) store_coll
     , collation(store_id collate danish_m_ci) expr_coll
  from stores
 order by store_id collate danish_m_ci
/

STORE CITY                 STORE_COL EXPR_COLL 
----- -------------------- --------- -----------
AB001 Aarhus               BINARY_CI DANISH_M_CI
AB002 Ålestrup             BINARY_CI DANISH_M_CI
BA001 karup                BINARY_CI DANISH_M_CI
BA002 KYBEHUSE             BINARY_CI DANISH_M_CI
BB001 AALBORG              BINARY_CI DANISH_M_CI
BB002 andst                BINARY_CI DANISH_M_CI
AA001 København            BINARY_CI DANISH_M_CI
AA002 Korsør               BINARY_CI DANISH_M_CI
ÅÅ001 AUNING               BINARY_CI DANISH_M_CI
ÅÅ002 Kaastrup             BINARY_CI DANISH_M_CI

An expression with a COLLATE operator can be given column alias and the alias keeps the collation information:

select store_id, city, store_id_m
     , collation(store_id) store_coll
     , collation(store_id_m) store_m_coll
  from (
   select store_id, city
        , store_id collate danish_m_ci as store_id_m
     from stores
       )
 order by store_id_m 
/

STORE CITY                 STORE STORE_COL STORE_M_COL
----- -------------------- ----- --------- -----------
AB001 Aarhus               AB001 BINARY_CI DANISH_M_CI
AB002 Ålestrup             AB002 BINARY_CI DANISH_M_CI
BA001 karup                BA001 BINARY_CI DANISH_M_CI
BA002 KYBEHUSE             BA002 BINARY_CI DANISH_M_CI
BB001 AALBORG              BB001 BINARY_CI DANISH_M_CI
BB002 andst                BB002 BINARY_CI DANISH_M_CI
AA001 København            AA001 BINARY_CI DANISH_M_CI
AA002 Korsør               AA002 BINARY_CI DANISH_M_CI
ÅÅ001 AUNING               ÅÅ001 BINARY_CI DANISH_M_CI
ÅÅ002 Kaastrup             ÅÅ002 BINARY_CI DANISH_M_CI

Using the COLLATE operator to overrule a column collation allows us to see, that the difference between DANISH_CI and DANISH_M_CI is, that although both are case insensitive due to the _CI prefix, DANISH sorts double-A's as two single A's (unlike DANISH_M that considers a double-A to be the same as an Å):

select store_id, city
  from stores
 order by city collate danish_ci
/

STORE CITY               
----- --------------------
BB001 AALBORG             
AB001 Aarhus              
BB002 andst               
ÅÅ001 AUNING              
ÅÅ002 Kaastrup            
BA001 karup               
AA002 Korsør              
BA002 KYBEHUSE            
AA001 København           
AB002 Ålestrup            

The COLLATE operator also allows certain pseudo-collations to be used. Unlike a named collation like DANISH_M_CI, the collation USING_NLS_SORT is not a specific collation, but tells the database to use the collation of the currently active NLS settings, in this case BINARY:

select store_id, city
  from stores
 order by city collate using_nls_sort
/

STORE CITY               
----- --------------------
BB001 AALBORG             
ÅÅ001 AUNING              
AB001 Aarhus              
BA002 KYBEHUSE            
ÅÅ002 Kaastrup            
AA002 Korsør              
AA001 København           
BB002 andst               
BA001 karup               
AB002 Ålestrup            

The collations are not just used for sorting, but also for comparison (similar to NLS_COMP.) So when we look for cities containing a U, this comparison is case insensitive:

select store_id, city
  from stores
 where city like '%U%'
 order by city
/

STORE CITY               
----- --------------------
ÅÅ001 AUNING              
BA001 karup               
BA002 KYBEHUSE            
ÅÅ002 Kaastrup            
AB002 Ålestrup            
AB001 Aarhus              

Not just LIKE but other functions as well use the collation information, for example INSTR:

select store_id, city
  from stores
 where instr(city, 'h') > 0
 order by city
/

STORE CITY               
----- --------------------
BA002 KYBEHUSE            
AA001 København           
AB001 Aarhus              

And some comparison operators as well (the sequence of Å's following the A is because Å is the last letter of the danish alphabet.) This finds cities beginning with an A, and since the column has DANISH_M_CI comparison, double-A's are not consider as two single-A's, so those two cities are not found here:

select store_id, city
  from stores
 where city >= 'a' and city <= 'aååååååååååååååååååå'
 order by city
/

STORE CITY               
----- --------------------
BB002 andst               
ÅÅ001 AUNING              

But using >= and <= comparison to look for cities beginning with an Å finds the two cities with double-A:

select store_id, city
  from stores
 where city >= 'å' and city <= 'åååååååååååååååååååå'
 order by city
/

STORE CITY               
----- --------------------
BB001 AALBORG             
AB002 Ålestrup            
AB001 Aarhus              

The interesting surprise I found, is a bit different behaviour in LIKE operator. If we use LIKE to search for cities beginning with A, we do not get those with double-A's, so that's OK:

select store_id, city
  from stores
 where city like 'a%'
 order by city
/

STORE CITY               
----- --------------------
BB002 andst               
ÅÅ001 AUNING              

Then I would expect them to be part of the result when I use LIKE to search for cities beginning with Å - but no, they are not there either:

select store_id, city
  from stores
 where city like 'å%'
 order by city
/

STORE CITY               
----- --------------------
AB002 Ålestrup            

In order to find them with LIKE, I need explicitly to look for two A's:

select store_id, city
  from stores
 where city like 'aa%'
 order by city
/

STORE CITY               
----- --------------------
BB001 AALBORG             
AB001 Aarhus              

Using >= and <= involves ordering/sorting to identify what is greater than or less than, so that should work fine, and it does.

Using LIKE could conceptually be thought of as the same (in which case it should have given same result) or it could be thought of something like SUBSTR(city,1,1)='a' (in which case the double-A's should have been found when searching for cities beginning with A.) The weird part is that LIKE seems to do neither, but has a logic all of its own? I don't know if that's expected behaviour or perhaps sort of buggy...

Anyway, my general conclusion is that using the new collation features makes some things easier to do and implement than using NLS_SORT / NLS_COMP. I'd suggest looking into collation if you have an application with those needs.

No comments:

Post a Comment