Monday, August 8, 2011

How SQL Server handles multi byte characters?

To store multi byte characters you need to use Unicode data type in SQL Server. SQL Server has four data types for Unicode characters. (nchar, nvarchar, nvarchar(max),ntext)

See below for the MSDN definitions for the above data types; (except ntext)

Character data types that are either fixed-length, nchar, or variable-length, nvarchar, Unicode data and use the UNICODE UCS-2 character set.

nchar [ ( n ) ]
Fixed-length Unicode character data of n characters. n must be a value from 1 through 4,000. The storage size is two times n bytes. The ISO synonyms for nchar are national char and national character.
nvarchar [ ( n | max ) ]
Variable-length Unicode character data. ncan be a value from 1 through 4,000. max indicates that the maximum storage size is 2^31-1 bytes. The storage size, in bytes, is two times the number of characters entered + 2 bytes. The data entered can be 0 characters in length. The ISO synonyms for nvarchar are national char varying and national character varying.


nchar, nvarchar

Consider below example;

CREATE TABLE dbo.UnicodeTest

(
     Col1 nchar(1)
)

What is your understanding of the Col1? Do you think, “Col1” will always accept one character? Ok. Let’s do a small test and see.


When I execute the above INSERT statement, I received the following error;

What that means? It means the allocated storage for “Col1” is not enough to accommodate the character that I’m inserting.  Then our next question is how much SQL Server allocates for Col1. Look at the MSDN definition stated in top part of this article. As per the dentition, SQL Server uses, UNICODE UCS-2 character set to store Unicode data. UCS-2 means UTF-16 encoding. Which means it allocates n of two bytes to store Unicode data. Which means in our table accept only two byte for Col1. (n=1)

Let’s increase the Col1 length to two and then try to do the insert again.

ALTER TABLE dbo.UnicodeTest ALTER COLUMN Col1 nchar(2)


Storage perspective Col1 now accepts four bytes. Now insert statement executed successfully.

See the table values. 

SELECT * FROM dbo.UnicodeTest


Copy and paste the Col1 value to SSMS and then you will see the actual character.

Let’s see the space utilization of the table.


It allocated one data page (8K) as expected.

Conclusion

In Unicode data presentation is in multi bytes. There are several encoding systems are available in Unicode. (UTF-8, TTF-16, UTF-16) UTF-8 allocates 8bits sets (up to 4) to store Unicode characters. UTF-16 uses 16bits sets (up to 4) and UTF-32 uses 32bits. Most of the web applications use UTF-8 encoding. (You can check this in your brower settings. Look at encoding option) SQL Server uses UTF-16 to store data.


UTF-16/UCS-2

SQL Server, Unicode data types we normally specify how many 16 bits sets you need. Nchar(1) means it allocates only 16 bit to store a single character. In the above example I was inserting multi bytes character which you cant store in 16bit set. So you need 2, 16bit sets for that. When I increased the length of the “Col1” to 2, it inserted successfully. 

No comments:

Post a Comment

How to fix cardinality estimation anomalies [Video]

Use the link mentioned below to watch the presentation that I delivered for PASS DBA Virtual Chapter about Filtered Statistics.  http:...