An interesting problem came up at work the other day. A DBA had been trying to import data from one database (created in a standard 8-bit characterset) to another (created in the UTF-8 characterset). The import failed because, on one table, the length of the data being imported exceeded the length of the column in the new database.
How could this be? Well, the column was a VARCHAR2(10) in both databases, but the new database was using a variable length characterset: UTF-8 allows between 1 and 4 bytes per character. Something that had been stored as a single byte in the original database was now being stored as a 2-, 3- or 4-byte character in the new database, thus blowing the budget on the original 10-bytes allowed for this particular column. (For the record, Microsoft’s smart quotes will do this!)
As a workaround, and a way to allow the import to finish without incident, the DBA involved made the decision to pre-create this table in the new database, before doing the import, with a declaration of VARCHAR2(10 CHAR) -the use of character semantics in this way ensures that the column is allowed to store 10 characters no matter how many bytes might be needed to do so. And that certainly solved the import problem.
But it presented a new worry, the answer to which I didn’t know for sure, but could only guess: if there was PL/SQL code in the original database which referenced that column and loaded it into a variable declared as a VARCHAR2(10), would that code now break when it tried to run in the new database?
Put it another way: does PL/SQL use the overall database byte/character semantics? Or does it match the table semantics it’s run against?
My guess was that the code would break, because PL/SQL would use the default database semantics for column/variable length, not the actual semantics used in any specific case. But I didn’t know for sure, and so a test was in order.
Well, the first thing is to make sure our client session is working in the right characterset, otherwise all bets are off:
[[email protected] ~]$ export NLS_LANG=.AL32UTF8
And here’s the initial state of play:
SQL> select property_name, property_value 2 from database_properties 3 where property_name like 'NLS_CHAR%' 4 or property_name like '%SEMANT%'; PROPERTY_NAME PROPERTY_VALUE ---------------------------- --------------------- NLS_CHARACTERSET AL32UTF8 NLS_LENGTH_SEMANTICS BYTE
So both my client and database use the AL32UTF characterset (variable in length up to 4 bytes per character). My database also uses “BYTE semantics” (meaning that if something is said to be VARCHAR2(10), the ’10′ means 10 bytes, not 10 characters. This is actually the default value for NLS_LENGTH_SEMANTICS, and is how most Oracle databases (probably) would be configured.
Now let me create a table and do what my original DBA did: over-ride the default length semantics and make a column capable of storing a fixed number of characters no matter how many bytes are needed to store them:
SQL> create table hjr_t1 (col1 varchar2(10 CHAR)); Table created.
By specifying ‘CHAR’, I’m insisting that the ’10′ in the varchar(10) bit of syntax means ’10 characters’. No matter how many bytes are needed to store them all. This hopefully explains the following output:
SQL> select owner, data_length from dba_tab_columns 2 where table_name='HJR_T1'; OWNER DATA_LENGTH ------------------------------ ----------- SYS 40
Since UTF-8 is a variable width character set that, sometimes, needs four bytes to store certain characters, Oracle has internally made the column 40 bytes long to deal with that possibility.
Now let me store some data in this table:
SQL> insert into hjr_t1 values ('ÁÇÖÜßîĂĆŐş'); 1 row created. SQL> commit; Commit complete.
So that’s 10 ‘foreign’ characters, and they will each definitely take up more than a single byte of storage. You can see that with this query:
SQL> select col1, length(col1), lengthb(col1) from hjr_t1; COL1 LENGTH(COL1) LENGTHB(COL1) -------------------- ------------ ------------- ÁÇÖÜßîĂĆŐş 10 20
The LENGTH function always shows you the number of characters in a string (and has done so for all time). The LENGTHB function, though, shows you the number of bytes a string represents. Again, in a single-byte characterset environment (or one that only used plain English), there wouldn’t be a difference between the two… but that doesn’t apply here!
So now the original question: if I write code that assigns table data to a VARCHAR2(10) variable, what will happen? Er, this:
SQL> declare 2 v1 varchar2(10); 3 begin 4 select col1 into v1 from hjr_t1; 5 dbms_output.put_line(v1); 6 end; 7 / declare * ERROR at line 1: ORA-06502: PL/SQL: numeric or value error: character string buffer too small ORA-06512: at line 4
So I was right in my initial guess: code which mentions varchar2s without specifying byte or character semantics, when the database itself uses BYTE semantics by default, is assumed itself to have meant byte semantics… and when transposed to a multi-byte characterset database, that code can fail.
Of course, you could change your database semantics:
SQL> alter system set nls_LENGTH_SEMANTICS=CHAR; System altered. SQL> set serveroutput on SQL> declare 2 v1 varchar2(10); 3 begin 4 select col1 into v1 from hjr_t1; 5 dbms_output.put_line(v1); 6 end; 7 / ÁÇÖÜßîĂĆŐş PL/SQL procedure successfully completed.
And you could even re-write your code: if you specifically declare v1 to be varchar2(10 CHAR), the problem goes away, too.
But there’s a further problem:
alter system set nls_LENGTH_SEMANTICS=BYTE;
That just makes the database use byte semantics once more, as our original database would have done. Now let’s write some stored procedural code:
SQL> create or replace procedure hproc 2 is 3 v1 varchar2(10); 4 begin 5 select col1 into v1 from hjr_t1; 6 dbms_output.put_line(v1); 7 end; 8 / Procedure created.
Now that’s pretty much exactly the same code as before, only now it’s part of a stored procedure. Let’s try executing it:
SQL> exec hproc BEGIN hproc; END; * ERROR at line 1: ORA-06502: PL/SQL: numeric or value error: character string buffer too small ORA-06512: at "SYS.HPROC", line 5 ORA-06512: at line 1
And we get the expected problem about the data not fitting the 10-bytes allocated to the v1 variable. No problems: let’s switch to CHAR semantics:
SQL> alter system set nls_LENGTH_SEMANTICS=CHAR; System altered. SQL> SQL> exec hproc BEGIN hproc; END; * ERROR at line 1: ORA-06502: PL/SQL: numeric or value error: character string buffer too small ORA-06512: at "SYS.HPROC", line 5 ORA-06512: at line 1
Ooops. It seems that stored procedures created when BYTE semantics are in force always use BYTE semantics, even if you alter the system to use CHARACTER semantics.
Simply put, if you’ve got a bazillion triggers, functions, packages or procedures that thought they were dealing with a column of 10 bytes, they will always expect 10 bytes, even if you change the default database semantics to be characters. You could always switch the semantics to CHAR and re-compile all your code, of course. At the point of re-compilation, all mentions of varchar2(x) would be assumed to be referring to ‘x characters’. That would get my code working once more… but it’s an awful lot of re-compilation and therefore isn’t exactly a trivial fix on a big database! The other problem with doing that is the fact that you’d end up, effectively, assigning 4 times the amount of memory to every varchar2 variable ever used, anywhere… even though the DBA only altered the semantics on one column in one table. Goodbye code breakage -but hello memory problems!
The original DBA has, therefore, created a bit of a problem! Certainly, creating the column as VARCHAR2(10 CHAR) allowed the import to proceed. But now we have, potentially, lots of code in the database that will suddenly have to deal with 2, 3 or 4 times more data than it was expecting to, and it will break as a result. No ‘alter system’ or ‘alter session’ command can fix that up, either!
As it happens, the only data that was preventing the initial import were things like Microsoft smart qotes, em-dashes and other wonders produced by Word ‘auto-correcting’ what you type. None of that data was really indispensable, in other words: we could have done a bulk bit of data conversion using the TRANSLATE function, swapping smart quotes for dumb ones, for example. Had we done that, there would have been no need to mess about with character or byte semantics in the first place: UTF-8 remains a single-byte character set when it’s dealing with basic English, so no column data would have threatened to ‘expand’ during the import. And thus none of our code would be liable to break (at least, not for these reasons!)
As it is, the imported database is now in production …and we have a bit of an issue on our hands!
The moral of the story: don’t mess with your semantics. Pick one when you first create your database and leave it at that. Don’t start mixing-and-matching for different tables, still less for different columns within the same table… for that way, chaos lies!