Tuesday, May 1, 2018

Corrupting characters - How to get invalid byte values stored in strings

Having worked with Database Migration Assistant for Unicode (DMU) to convert some databases from single-byte charactersets to AL32UTF8, I had problems with DMU reporting a lot of characters with invalid byte values (in this case binary values that did not exist in WE8ISO8859P15.)

So how can that happen? Doesn't the database enforce character encoding to match the database characterset?

Wrong - not always.

Well ok, you can get invalid values with single-byte charactersets - but once the database is AL32UTF8 then it can store all the characters in the world, so then it cannot happen, right?

Wrong again - you can still get corrupt character data if you do it the wrong way.




Let me demonstrate getting invalid byte representations in character values in a database with AL32UTF8 characterset:

select parameter, value from nls_database_parameters where parameter = 'NLS_CHARACTERSET';

PARAMETER            VALUE
-------------------- --------------------
NLS_CHARACTERSET     AL32UTF8

I'll be using SQL*Plus in specific environments I'll specify as I go along. Using GUI tools or other environments may give you different results.

A table of movies with a column TITLE is what I'll use to demonstrate the invalid bytes - the other two columns are to distinguish between the different INSERT calls I'll be doing:

create table movies (
   seq         integer
 , title       varchar2(30 char)
 , inserted_by varchar2(30 char)
);

First I'll use a session in Linux that uses a UTF-8 locale so it matches the database characterset:

[oracle@vbgeneric ~]$ locale | grep LANG
LANG=en_US.UTF-8

And I set the NLS_LANG environment variable so it matches the OS locale:

[oracle@vbgeneric ~]$ export NLS_LANG=american_america.al32utf8

I start SQL*Plus and execute INSERT number 1, inserting the title of a mexican film from 1971:

insert into movies values (1, 'Jesús, nuestro Señor', 'Lin UTF-8 AL32UTF8');
commit;

And I query the content of the table as it looks now:

select inserted_by, title
     , lengthb(title) as bytes, lengthc(title) as chars
     , dump(title) as title_dump
  from movies;

INSERTED_BY          TITLE                   BYTES CHARS TITLE_DUMP
-------------------- ----------------------- ----- ----- --------------------------------
Lin UTF-8 AL32UTF8   Jesús, nuestro Señor       22    20 Typ=1 Len=22: 74,101,115,195,186
                                                         ,115,44,32,110,117,101,115,116,1
                                                         14,111,32,83,101,195,177,111,114

The characters ú and ñ each use two bytes in the AL32UTF8 characterset - (195,186) respectively (195,177) in the title dump.

So here everything is peachy - the client uses AL32UTF8 and has told the database this fact via the NLS_LANG environment variable, so the client is trusted to send valid AL32UTF8 and no conversion happens.


Let me try using a session in a Windows CMD command-line that uses a US codepage 437:

C:\>chcp
Active code page: 437

And I set the NLS_LANG environment variable so it matches the OS locale:

C:\>set NLS_LANG=american_america.us8pc437

I start SQL*Plus and execute INSERT number 2, inserting the same title (just with a different INSERTED_BY value):

insert into movies values (2, 'Jesús, nuestro Señor', 'Win 437 US8PC437');
commit;

And I query the content of the table again with the same select statement as above:

INSERTED_BY          TITLE                   BYTES CHARS TITLE_DUMP
-------------------- ----------------------- ----- ----- --------------------------------
Lin UTF-8 AL32UTF8   Jesús, nuestro Señor       22    20 Typ=1 Len=22: 74,101,115,195,186
                                                         ,115,44,32,110,117,101,115,116,1
                                                         14,111,32,83,101,195,177,111,114

Win 437 US8PC437     Jesús, nuestro Señor       22    20 Typ=1 Len=22: 74,101,115,195,186
                                                         ,115,44,32,110,117,101,115,116,1
                                                         14,111,32,83,101,195,177,111,114

The result of the two inserts are identical - even though my Windows CMD SQL*Plus session uses the single-byte characterset US8PC437. And they both display correctly on my single-byte display, even though the dump shows the column content to be AL32UTF8 in both cases.

This is due to setting the NLS_LANG environment variable properly to match the OS characterset. Then it is recognized that the client is sending single-byte character values encoded in US8PC437, so they are converted upon INSERT to the database characterset. Likewise on SELECT it is recognized that the client expects single-byte character values encoded in US8PC437, so the database multi-byte values encoded in AL32UTF8 is converted to the client characterset. The characters ú and ñ both exist in US8PC437 characterset, so the characters can be converted correctly.

So far no problems. Now I'll create some problems...


Again I'll use a session on Linux with UTF-8 OS locale:

[oracle@vbgeneric ~]$ locale | grep LANG
LANG=en_US.UTF-8

But this time I simulate that I have forgotten to set the NLS_LANG environment variable by performing an UNSET to clear the variable:

[oracle@vbgeneric ~]$ unset NLS_LANG

I start SQL*Plus and execute INSERT number 3, inserting the same title again (just with a third INSERTED_BY value):

insert into movies values (3, 'Jesús, nuestro Señor', 'Lin UTF-8 {unset}');
commit;

And I query the content of the table again with the same select statement as above:

INSERTED_BY          TITLE                   BYTES CHARS TITLE_DUMP
-------------------- ----------------------- ----- ----- --------------------------------
Lin UTF-8 AL32UTF8   Jesus, nuestro Se?or       22    20 Typ=1 Len=22: 74,101,115,195,186
                                                         ,115,44,32,110,117,101,115,116,1
                                                         14,111,32,83,101,195,177,111,114
Win 437 US8PC437     Jesus, nuestro Se?or       22    20 Typ=1 Len=22: 74,101,115,195,186
                                                         ,115,44,32,110,117,101,115,116,1
                                                         14,111,32,83,101,195,177,111,114
Lin UTF-8 {unset}    Jes??s, nuestro Se??or     30    22 Typ=1 Len=30: 74,101,115,239,191
                                                         ,189,239,191,189,115,44,32,110,1
                                                         17,101,115,116,114,111,32,83,101
                                                         ,239,191,189,239,191,189,111,114

Now what happened?

The third insert has inserted 22 characters using 30 bytes? The output of that insert shows 22 characters? The output of the two previous inserts show 20 characters as expected, but the ú has become a plain u and the ñ has become a question mark?

To figure out why, I'll take a look at what the database thinks the client characterset is:

select sci.client_charset
from v$session_connect_info sci
where sci.sid = sys_context('USERENV', 'SID')
and network_service_banner like 'TCP%';

CLIENT_CHARSET
----------------------------------------
US7ASCII

Since NLS_LANG is not set, it defaults into believing the client is using US 7-bit ASCII...

So my INSERT is performed from a client that actually sends UTF-8 encoded characters, but the database believes them to be 7-bit ASCII and tries to convert accordingly. The ú character is sent by the client as two bytes (195,186) each of which is attempted to be converted from US7ASCII to AL32UTF8. Those two byte values do not exist in US7ASCII, so they are converted instead to the UTF-8 character � which is encoded by three bytes (239,191,189). (Depending on your browser font this may display as a black diamond shape with a white question mark or it may display as a square or something else.)

The same happens to the ñ character, so in total those two characters are converted to 4 occurrences of the � character, which can be observed in the title dump and accounts for the 30 bytes.

Then when we query the titles from the database, a conversion happens from AL32UTF8 to US7ASCII. For some reason those conversion rules let Oracle convert the ú from the first two inserts to a plain u, but there is no similar conversion rule to convert ñ to a plain n - instead it is converted to a question mark.

The 4 � characters also cannot be converted to a meaningful US7ASCII character - they are converted to question marks.


OK, so when I left NLS_LANG "unset", the database believed the client characterset to be a different one that it actually was, and the database tried to perform conversion from and to the perceived characterset instead of the correct client characterset. This created some bad conversions, but so far I still haven't got any invalid bytes in the database - just some wrong but valid characters.

I've tried INSERT twice with correct NLS_LANG that matches OS locale, and then INSERT with no NLS_LANG set. Now I'll be really naughty and totally mess up my settings...


I'll use a session on Linux that I set the OS locale to use Danish with an ISO-8859-1 characterset (and I set my MobaXTerm terminal setting accordingly too, so my display shows ISO-8859-1 characters correctly):

[oracle@vbgeneric bin]$ export LANG=da_DK.iso88591
[oracle@vbgeneric bin]$ locale | grep LANG
LANG=da_DK.iso88591

I set NLS_LANG too, but I set it wrong - instead of matching the client OS locale, I am setting it to match the database characterset:

[oracle@vbgeneric bin]$ export NLS_LANG=american_america.al32utf8

I start SQL*Plus and test what characterset it perceives my client to use:

select sci.client_charset
from v$session_connect_info sci
where sci.sid = sys_context('USERENV', 'SID')
and network_service_banner like 'TCP%';

CLIENT_CHARSET
----------------------------------------
AL32UTF8

It believes what I set in NLS_LANG - which was the wrong characterset I set there.

So I execute INSERT number 4, inserting the same title once again:

insert into movies values (4, 'Jesús, nuestro Señor', 'Lin ISO8859 AL32UTF8');

ERROR:
ORA-01756: quoted string not properly terminated

Oops?

Now I cannot even insert that title? Why?

In this case the database believes the client to use the same characterset as the database, so it (just like the first insert) does not perform any conversion at all. It believes the byte values that arrive from the client are correct AL32UTF8 byte values.

But the client is actually sending ISO-8859-1 encoded values, and the ñ character in ISO-8859-1 is decimal 241 or binary 11110001. The database believes this to be UTF-8, and in UTF-8 if the first byte of a character is 11110xxx, then it must (by the definition of the UTF-8 encoding) be a 4-byte character. So the database recognizes this and thinks the 4 bytes that contain ñor' actually are a single 4-byte UTF-8 character. Therefore it appears to the database that there is a single quote ' missing and it raises the ORA-01756 error.


To continue the demonstration I'll add a couple of characters to the movie title:

insert into movies values (5, 'Jesús, nuestro Señores', 'Lin ISO8859 AL32UTF8');
commit;

This time the insert completes successfully, since now it is the 4 byte sequence ñore that is believed to be a 4-byte UTF-8 character, so the single quote ' is found and the byte values are inserted into the table.

Now I can query the content of the table once more with the same select statement as above:

INSERTED_BY          TITLE                   BYTES CHARS TITLE_DUMP
-------------------- ----------------------- ----- ----- --------------------------------
Lin UTF-8 AL32UTF8   Jesús, nuestro Señor       22    20 Typ=1 Len=22: 74,101,115,195,186
                                                         ,115,44,32,110,117,101,115,116,1
                                                         14,111,32,83,101,195,177,111,114
Win 437 US8PC437     Jesús, nuestro Señor       22    20 Typ=1 Len=22: 74,101,115,195,186
                                                         ,115,44,32,110,117,101,115,116,1
                                                         14,111,32,83,101,195,177,111,114
Lin UTF-8 {unset}    Jes��s, nuestro Se��or     30    22 Typ=1 Len=30: 74,101,115,239,191
                                                         ,189,239,191,189,115,44,32,110,1
                                                         17,101,115,116,114,111,32,83,101
                                                         ,239,191,189,239,191,189,111,114
Lin ISO8859 AL32UTF8 Jesús, nuestro Señores       22    19 Typ=1 Len=22: 74,101,115,250,115
                                                         ,44,32,110,117,101,115,116,114,1
                                                         11,32,83,101,241,111,114,101,115

Viewing the data it appears like the last insert is the only correct one, but that is misleading - it is in fact the one that now has corrupt and invalid data.

The title of the first two inserts contain in the database two-byte characters, which are sent unconverted to the client (since the perceived client characterset is identical to the database characterset), but the client believes those bytes to be encoded in ISO-8859-1 and displays accordingly as ú and ñ , which takes up double the space than SQL*Plus believes, so the columns are not aligned.

The same happens to the third insert where the 4 � characters are sent unconverted as 4 sequences of 3 bytes, that each are displayed in ISO-8859-1 as � , which takes in total 8 bytes more space than believed, so the columns are really badly aligned here.

The last insert the ISO-8859-1 byte values were unconverted inserted into the database and they are unconverted sent back to the client, so of course the client displays them correctly. Since we added trailing es to be allowed to insert the title, we actually should have 22 characters here - but the database reports with LENGTHC (column CHARS) that there is only 19 characters. This is again because it believes the 4 byte sequence ñore to be a single 4-byte UTF-8 character.

But what about the ú character? Why is that also not believed to be the beginning of a multi-byte UTF-8 character? This is because in the definition of UTF-8 encoding a single-byte character is always binary 0xxxxxxx, the first byte of two-byte characters is always 110xxxxx, the first byte of three-byte characters is always 1110xxxx and the first byte of a four-byte character is always 11110xxx. The ú character encoded in ISO-8859-1 is decimal 250 or binary 11111010, which does not fit any of the rules of UTF-8 encoding. Apparently the LENGTHC function then considers it a single character (even if it actually is an invalid, corrupt character.)


I can go back to a correctly setup UTF-8 Linux session again (remembering to set the terminal accordingly):

[oracle@vbgeneric bin]$ export LANG=en_US.UTF-8
[oracle@vbgeneric bin]$ locale | grep LANG
LANG=en_US.UTF-8

[oracle@vbgeneric bin]$ export NLS_LANG=american_america.al32utf8

And do a final query of the content of the table just to sum up:

INSERTED_BY          TITLE                   BYTES CHARS TITLE_DUMP
-------------------- ----------------------- ----- ----- --------------------------------
Lin UTF-8 AL32UTF8   Jesús, nuestro Señor       22    20 Typ=1 Len=22: 74,101,115,195,186
                                                         ,115,44,32,110,117,101,115,116,1
                                                         14,111,32,83,101,195,177,111,114
Win 437 US8PC437     Jesús, nuestro Señor       22    20 Typ=1 Len=22: 74,101,115,195,186
                                                         ,115,44,32,110,117,101,115,116,1
                                                         14,111,32,83,101,195,177,111,114
Lin UTF-8 {unset}    Jes��s, nuestro Se��or     30    22 Typ=1 Len=30: 74,101,115,239,191
                                                         ,189,239,191,189,115,44,32,110,1
                                                         17,101,115,116,114,111,32,83,101
                                                         ,239,191,189,239,191,189,111,114
Lin ISO8859 AL32UTF8 Jes▒s, nuestro Se▒ores       22    19 Typ=1 Len=22: 74,101,115,250,115
                                                         ,44,32,110,117,101,115,116,114,1
                                                         11,32,83,101,241,111,114,101,115

The first two inserts were fine - NLS_LANG matched the OS locale in both cases.

The third made wrong conversions of the data both going in and going out due to missing NLS_LANG - but the characters as such in the database are valid binary values for AL32UTF8.

The fourth lied to the database and claimed to sending UTF-8 binary values when it actually was sending ISO-8859-1 - the database believed the lie and inserted invalid binary values for AL32UTF8.

In the last case I now have corrupt characters in the database. These cannot be displayed correctly in a client with a correct NLS_LANG setting - they will only display correctly on a client with the same wrong NLS_LANG setting as used at INSERT time.


The lesson is to be aware of your NLS_LANG setting - that it should match the client OS characterset / locale. I have seen several cases where it was believed that NLS_LANG should match the database characterset - which then potentially can lead to corrupt data as demonstrated here.

Check that your client installation did set NLS_LANG correctly - either as environment variable or in the Windows registry for your client Oracle home. Be aware of whether your client uses NLS_LANG at all - for example some GUI tools may allow you to override client characterset as a setting or option. In Windows also be aware of GUI programs typically using WIN-1252 / ANSI (a variant of ISO-8859-1), while CMD programs will use an ASCII codepage like CP-437.

Have fun with the vagaries of charactersets :-)

5 comments:

  1. Hello Kim,

    Thank you so much for another excellent post :):)

    You are the best Oracle teacher that I can imagine :)

    By the way, since I know that you love to cook,
    if you are an equally amazing cooking teacher, then I should also learn from you,
    seriously :)

    Cheers & Best Regards,
    Iudith

    ReplyDelete
    Replies
    1. Thanks, Iudith

      Teaching cooking? You wouldn't want that - my cooking style is "you just use a pinch of this and a dash of that and whatever you have in the fridge" without exact measures and methodical step-by-step instructions ;-)

      Cheerio
      /Kim

      Delete
  2. Very good post, thank you! Encoding-hell is one of the most annoying problems in software development.

    So, if I understood correctly, the NLS_LANG setting *must match* the client's encoding, right?

    Imagine I'm using UTF-8 in my app/client but the database uses ISO8859-1, so do I have to set the NLS_LANG up to UTF-8?

    ReplyDelete
    Replies
    1. Yes, exactly.

      The encoding part of NLS_LANG (the bit after the dot) must match the client encoding.
      It tells Oracle what encoding to use to interpret the bytes of text arriving from the client, as well as what encoding it should use to deliver text content back to the client.

      So if your client is using UTF-8 encoding, NLS_LANG should *always* be set to for example AMERICAN_AMERICA.AL32UTF8 (or whichever language and territory you use) - *no matter what the database characterset is* !

      If the database then uses for example WE8ISO8859P1, then content arriving from your client will be converted to that characterset on the way into the database.
      Those Unicode characters that exist in ISO8859-1 (like for example many of the characters with diacritics like ñ and ú) will be converted to the correct corresponding binary values.
      Those Unicode characters that do *not* exist in ISO8859-1 (like smileys or chinese characters) will become question marks (or some other character like a square or something being used to indicate "impossible character conversion".)

      Similar conversion will happen on data that your app queries - for example a ñ that is stored in the database in ISO8859-1 encoding in a single byte will be delivered to your client as a two-byte UTF-8 encoded Unicode ñ character.

      You're right, encoding can be very troublesome for developers. The sad part is, that quite a few Oracle SQL developers use much too much work trying to include conversion calls in their code - when in reality the Oracle client software can do the needed conversion on data going in and out of the database quite automatically, if only you tell it which encoding to convert from/to in the NLS_LANG setting.

      Happy encoding :-D

      Delete
    2. Thanks for this pretty good explanation! Keep going with those great articles, Kim!

      Delete