Unicode in SQL Server unique constraints - sql

Consider the following script - the second INSERT statement throws a primary key violation.
BEGIN TRAN
CREATE TABLE UnicodeQuestion
(
UnicodeCol NVARCHAR(100)
COLLATE Latin1_General_CI_AI
)
CREATE UNIQUE INDEX UX_UnicodeCol
ON UnicodeQuestion ( UnicodeCol )
INSERT INTO UnicodeQuestion (UnicodeCol) VALUES (N'ae')
INSERT INTO UnicodeQuestion (UnicodeCol) VALUES (N'æ')
ROLLBACK
As I understand it, if I want to have my index treat these values separately, I need to use a binary collation. But there are many binary collations, and they have individual cultures in their names! I don't want culture-sensitive treatment...
Which collation should I use when storing arbitrary Unicode data in nvarchar columns?

For Unicode data it is irrelevant what binary collation you choose.
For Unicode data types, data comparisons are based on the Unicode
code points. For binary collations on Unicode data types, the locale
is not considered in data sorts. For example, Latin_1_General_BIN and
Japanese_BIN yield identical sorting results when used on Unicode
data.
The reason for having locale specific BIN collations is that this determines the code page used when dealing with non Unicode data.

Related

Get the encoded value of a character with current code page

I created a database in SQL Server with COLLATION KR949_BIN2, which means that the codepage of this database is 949.
Is it possible to get the encoded value of a character based on the codepage in this database?
For example, the encoded value of character '좩' in codepage 949 is 0xA144, is there a SQL statement that I can get 0xA144 from char '좩' in this database?
Also, is there a way to insert '좩' into a column by its encoded value 0xA144?
Based on Character data is represented incorrectly when the code page of the client computer differs from the code page of the database in SQL Server 2005 I suspect that you're actually using the Korean_Wansung_CI_AS collation or something similar:
Method 2: Use an appropriate collation for the database
If you must use a non-Unicode data type, always make sure that the code page of the database and the code page of any non-Unicode columns can store the non-Unicode data correctly. For example, if you want to store code page 949 (Korean) character data, use a Korean collation for the database. For example, use the Korean_Wansung_CI_AS collation for the database.
That being the case, yes, you can see and insert 0xA144 as per the following example:
create table #Wansung (
[Description] varchar(50),
Codepoint varchar(50) collate Korean_Wansung_CI_AS
);
insert #Wansung ([Description], Codepoint)
select 'U+C8A9 Hangul Syllable Jwaeg', N'좩';
insert #Wansung ([Description], Codepoint)
select 'From Windows-949 encoding', 0xA144;
select [Description], Codepoint, cast(Codepoint as varbinary(max)) as Bytes, cast(unicode(Codepoint) as varbinary(max)) as UTF32
from #Wansung;
Which returns the results:
Description
Codepoint
Bytes
UTF32
U+C8A9 Hangul Syllable Jwaeg
좩
0xA144
0x0000C8A9
From Windows-949 encoding
좩
0xA144
0x0000C8A9

How to manage order by in SQL server as compare to Sybase?

I'm migrating from SybaseIQ to SQL Server 2008, one major diffrence observer is in ORDER BY clause.
Created on table as create table test(name varchar(20))
Inserted some records:
insert into test values('Hi')
insert into test values('Toi')
insert into test values('>Toi')
insert into test values('iHh')
insert into test values('hi')
insert into test values('IhH')
insert into test values('1Hi')
insert into test values('2Hi')
Performed select operation on both SQL Server and Sybase as:
select * from test order by name desc
Result for Sybase is:
name
-------
iHh
hi
Toi
IhH
Hi
>Toi
2Hi
1Hi
And result for SQL server is:
name
-------
Toi
IhH
iHh
Hi
hi
2Hi
1Hi
>Toi
Why this order differ in SQL server and Sybase? How to manage order by in SQL server as compare to Sybase to get same result?
You can use the collation Latin1_General_BIN2 as the SQL server default collation or use the specify collation for the ORDER clause.
Binary collations
Binary collations sort data based on the sequence of
coded values that are defined by the locale and data type. They are
case sensitive. A binary collation in SQL Server defines the locale
and the ANSI code page that will be used. This enforces a binary sort
order. Because they are relatively simple, binary collations help
improve application performance. For non-Unicode data types, data
comparisons are based on the code points that are defined in the ANSI
code page. For Unicode data types, data comparisons are based on the
Unicode code points. For binary collations on Unicode data types, the
locale is not considered in data sorts. For example,
Latin_1_General_BIN and Japanese_BIN yield identical sorting results
when they are used on Unicode data.
There are two types of binary
collations in SQL Server; the older BIN collations and the newer BIN2
collations. In a BIN2 collation all characters are sorted according to
their code points. In a BIN collation only the first character is
sorted according to the code point, and remaining characters are
sorted according to their byte values. (Because the Intel platform is
a little endian architecture, Unicode code characters are always
stored byte-swapped.)
declare #Test table (name varchar(20) collate Latin1_General_BIN2)
insert #Test values ('Hi'), ('Toi'), ('>Toi'), ('iHh'), ('hi'), ('IhH'), ('1Hi'), ('2Hi')
select * from #Test order by name desc
Or just
select * from #Test order by name collate Latin1_General_BIN2 desc
Use ASCII Value for column and order by it in Descending order. ASCII will order your first later of word then you can apply order on your column
SELECT Name FROM #tblTest ORDER BY ASCII(Name) desc,Name
Output in Descending:
Output in Ascending:

Determining Nvarchar length

I've read all about varchar versus nvarchar. But I didn't see an answer to what I think is a simple question. How do you determine the length of your nvarchar column? For varchar it's very simple: my Description, for example, can have 100 characters, so I define varchar(100). Now I'm told we need to internationalize and support any language. Does this mean I need to change my Description column to nvarchar(200), i.e. simply double the length? (And I'm ignoring all the other issues that are involved with internationalization for the moment.)
Is it that simple?
Generally it is the same as for varchar really. The number is still the maximum number of characters not the data length.
nvarchar(100) allows 100 characters (which would potentially consume 200 bytes in SQL Server).
You might want to allow for the fact that different cultures may take more characters to express the same thing though.
An exception to this is however is if you are using an SC collation (which supports supplementary characters). In that case a single character can potentially take up to 4 bytes.
So worst case would be to double the character value declared.
From microsoft web site:
A common misconception is to think that NCHAR(n) and NVARCHAR(n), the n defines the number of characters. But in NCHAR(n) and NVARCHAR(n) the n defines the string length in byte-pairs (0-4,000). n never defines numbers of characters that can be stored. This is similar to the definition of CHAR(n) and VARCHAR(n).
The misconception happens because when using characters defined in the Unicode range 0-65,535, one character can be stored per each byte-pair. However, in higher Unicode ranges (65,536-1,114,111) one character may use two byte-pairs. For example, in a column defined as NCHAR(10), the Database Engine can store 10 characters that use one byte-pair (Unicode range 0-65,535), but less than 10 characters when using two byte-pairs (Unicode range 65,536-1,114,111). For more information about Unicode storage and character ranges, see
https://learn.microsoft.com/en-us/sql/t-sql/data-types/nchar-and-nvarchar-transact-sql?view=sql-server-ver15
#Musa Calgar - exactly right. That link has the information for the answer to this question.
But to make sure the question itself is clear, we are talking about the 'length' attribute we see when we look at the column definition for a given table, right? That is the storage allocated per column. On the other hand, if we want to know the number of characters for a given string in the table at a given moment you can:
"SELECT myColumn, LEN(myColumn) FROM myTable"
But if the storage length is desired, you can drag the table name into the query window using SSMS, highlight it, and use 'Alt-F1' to see the defined lengths of each column.
So as an example, I created a table like this specifiying collations. (Latin1_General_100_CI_AS_SC allows for supplemental characters - that is, characters that take more than just 2 bytes):
CREATE TABLE [dbo].[TestTable1](
[col1] [varchar](10) COLLATE Latin1_General_100_CI_AS,
[col2] [nvarchar](10) COLLATE Latin1_General_100_CI_AS_SC,
[col3] [nvarchar](10) COLLATE Latin1_General_100_CI_AS
) ON [PRIMARY]
The lengths show up like this (Highlight in query window and Alt-F1):
Column_Name Type Length [...] Collation
col1 varchar 10 Latin1_General_100_CI_AS
col2 nvarchar 20 Latin1_General_100_CI_AS_SC
col3 nvarchar 20 Latin1_General_100_CI_AS
If you insert ASCII characters into the varchar and nvarchar fields, it will allow you to put 10 characters into all of them. There will be an error if you try to put more than 10 characters into those fields:
"String or binary data would be truncated.
The statement has been terminated."
If you insert non-ASCII characters like 'ā' you can still put 10 of them into each one, but SQL Server will convert the values going into col1 to the closest known character that fits into 1-byte. In this case, 'ā' will be converted to 'a'.
However, if you insert characters that require 4 bytes to store, like for example, '𠜎', you will only be allowed to put FIVE of them into the varchar and nvarchar fields. Any more than that will result in the truncation error shown above. The varchar field will show question marks because it has no single-byte character that it can convert that input to.
So when you insert five of these '𠜎', do a select of that row using len(<colname>) and you will see this:
col1 len(col1) col2 len(col2) col3 len(col3)
?????????? 10 𠜎𠜎𠜎𠜎𠜎 5 𠜎𠜎𠜎𠜎𠜎 10
So the length of col2 shows 5 characters since supplemental characters were defined when the table was created (see above CREATE TABLE DDL statement). However, col3 did not have _SC for its collation, so it is showing length 10 for the five characters we inserted.
Note that col1 has ten question marks. If we had defined the col1 varchar using the _SC collation instead of the non-supplemental one, it would behave the same way.

Unicode characters in Sql table

I am using Sql Server 2008 R2 Enterprise. I am coding an application capable of inserting, updating, deleting and selecting records from a Sql tables. The application is making errors when it comes to the records that contain special characters such as ć, č š, đ and ž.
Here's what happens:
The command:
INSERT INTO Account (Name, Person)
VALUES ('Boris Borenović', 'True')
WHERE Id = '1'
inserts a new record but the Name field is Boris Borenovic, so character ć is changed to c.
The command:
SELECT * FROM Account
WHERE Name = 'Boris Borenović'
returns the correct record, so again the character ć is replaced by c and the record is returned.
Questions:
Is it possible to make Sql Server save the ć and other special characters mentioned earlier?
Is it still possible, if the previous question is resolved, to make Sql be able to return the Boris Borenović record even if the query asks for Boris Borenovic?
So, when saving records I want Sql to save exactly what is given, but when retrieving the records, I want it to be able to ingnore the special characters. Thanks for all the help.
1) Make sure the column is of type nvarchar rather than varchar (or nchar for char)
2) Use N' at the start of string literals containing such strings, e.g. N'Boris Borenović'
3) If you're using a client library (e.g. ADO.Net), it should handle Unicode text, so long as, again, the parameters are marked as being nvarchar/nchar instead of varchar/char
4) If you want to query and ignore accents, then you can add a COLLATE clause to your select. E.g.:
SELECT * FROM Account
WHERE Name = 'Boris Borenovic' COLLATE Latin1_General_CI_AI
Where _CI_AI means Case Insensitive, Accent Insensitive, should return all rows with all variants of the "c" at the end.
5) If the column in the table is part of a UNIQUE/PK constraint, and you need it to contain both "Boris Borenović" and "Boris Borenovic", then add a COLLATE clause to the column definition, but this time use a collation with "_AS" at the end, which says that it's accent sensitive.
To allow SQL Server to store special characters, use nvarchar instead of varchar for the column type.
When retrieving, you can force a accent-insensitve collation so that it ignores the different C's:
WHERE Name = 'Boris Borenović' COLLATE Cyrillic_General_CI_AI
Here, CI stands for Case Insensitive, and AS for Accent Insensitive.
I've faced with the same problem and after some researching:
https://dba.stackexchange.com/questions/139551/how-do-i-set-a-sql-server-unicode-nvarchar-string-to-an-emoji-or-supplementary
What is the difference between varchar and nvarchar?
I altered type of needed fields:
ALTER TABLE [table_name] ALTER COLUMN column_name [nvarchar]
GO
And it works!

Achieving properties of binary and collation at the same time

I have a varchar field in my database which i use for two significantly different things. In one scenario i use it for evaluating with case sensitivity to ensure no duplicates are inserted. To achieve this I've set the comparison to binary. However, I want to be able to search case-insensitively on the same column values. Is there any way I can do this without simply creating a redundant column with collation instead of binary?
CREATE TABLE t_search (value VARCHAR(50) NOT NULL COLLATE UTF8_BIN PRIMARY KEY);
INSERT
INTO t_search
VALUES ('test');
INSERT
INTO t_search
VALUES ('TEST');
SELECT *
FROM t_search
WHERE value = 'test' COLLATE UTF8_GENERAL_CI;
The second query will return both rows.
Note, however, that anything with COLLATE applied to it has the lowest coercibility.
This means that it's value that will be converted to UTF8_GENERAL_CI for the comparision purposes, not the other way round, which means that the index on value will not be used for searching and the condition in the query will be not sargable.
If you need good performance on case-insensitive searching, you should create an additional column with case-insensitive collation, index it and use in the searches.
you can use the COLLATE statement to change the collation on a column in a query. see this manual page for extensive examples.