SQL Server : how do I get SQL to recognize ö and ß correctly - sql

I currently have a table that has 4 columns. The ID of the object, ID of another object in another table, nvarchar data, and a bool.
PK is made up of the first 3 columns.
The values größe conflicts with grösse, and große conflicts with grosse
meaning I can have one of the first two and one of the second two, but not all of them
The column has collation set to SQL_Latin1_General_CS_AS
I believe this is where the problem lies but this does handle many other unicode characters correctly. Has anyone encountered this and know what my problem is?
For reference both of these play okay with all of the above.
gråsse
grøsse
Example for clarity, for me this is printing equal:
IF (N'grösse' COLLATE Latin1_General_CS_AS = N'größe' COLLATE Latin1_General_CS_AS)
BEGIN
PRINT 'EQUAL'
END
When I expect these to be different.

handle many other unicode characters correctly
What does correctly mean to you? The different collations in SQL Server have different behavior. Maybe you are looking for a binary collation like LATIN1_GENERAL_BIN2. This one compares code-points only. Duplicates will only occur when the strings are binary-identical. Your example code would behave like you want it to.
The non-binary collations try to apply lexicographic rules. They sort and compare strings like a phone book would.

Related

SQL NVARCHAR(MAX) returning ASCII and Weird Characters instead of Text

I have an SQL Table and I'm trying to return the values as a string.
The values should be city names like Sydney, Melbourne, Port Maquarie etc.
But When I run a select I either get black results or as detailed in the first picture some strange backwards L character. The column is an NVARCHAR(MAX)
SELECT ctGlobalName FROM Crm.Cities
Then I tried using MSSQL's Edit top 200 rows feature and I could see the names of the cities, but also all these weird ascii characters.
Now I didn't create the database, I'm just running queries on it. Some things I've read have suggested it is a problem with the Collation. But the table is SQL_Latin1_General_CP1_CI_AS which matches the server collation.
I'm sure there must be something I can add to my select query to return the values as an ordinary string. Is there something I can do to my select query to return the expected format without the weird characters?
An NVARCHAR datatype can store Unicode characters, which are used for languages that are not supported by the ASCII character set i.e. non-English (or related) languages such as Chinese or Indonesian. If your SQL Server or Windows doesn't have that language installed then you might see strange-looking representations of the data.
On the other hand, it could also be that the application that updates this table has just stored bad data in that column.
Either way you might need to do some string manipulation to strip out the characters you don't want.

What does collation mean?

What does collation mean in SQL, and what does it do?
Collation can be simply thought of as sort order.
In English (and it's strange cousin, American), collation may be a pretty simple matter consisting of ordering by the ASCII code.
Once you get into those strange European languages with all their accents and other features, collation changes. For example, though the different accented forms of a may exist at disparate code points, they may all need to be sorted as if they were the same letter.
Besides the "accented letters are sorted differently than unaccented ones" in some Western European languages, you must take into account the groups of letters, which sometimes are sorted differently, also.
Traditionally, in Spanish, "ch" was considered a letter in its own right, same with "ll" (both of which represent a single phoneme), so a list would get sorted like this:
caballo
cinco
coche
charco
chocolate
chueco
dado
(...)
lámpara
luego
llanta
lluvia
madera
Notice all the words starting with single c go together, except words starting with ch which go after them, same with ll-starting words which go after all the words starting with a single l. This is the ordering you'll see in old dictionaries and encyclopedias, sometimes even today by very conservative organizations.
The Royal Academy of the Language changed this to make it easier for Spanish to be accomodated in the computing world. Nevertheless, ñ is still considered a different letter than n and goes after it, and before o. So this is a correctly ordered list:
Namibia
número
ñandú
ñú
obra
ojo
By selecting the correct collation, you get all this done for you, automatically :-)
Rules that tell how to compare and sort strings: letters order; whether case matters, whether diacritics matter etc.
For instance, if you want all letters to be different (say, if you store filenames in UNIX), you use UTF8_BIN collation:
SELECT 'A' COLLATE UTF8_BIN = 'a' COLLATE UTF8_BIN
---
0
If you want to ignore case and diacritics differences (say, for a search engine), you use UTF8_GENERAL_CI collation:
SELECT 'A' COLLATE UTF8_GENERAL_CI = 'ä' COLLATE UTF8_GENERAL_CI
---
1
As you can see, this collation (comparison rule) considers capital A and lowecase ä the same letter, ignoring case and diacritic differences.
Collation defines how you sort and compare string values
For example, it defines how to deal with
accents (äàa etc)
case (Aa)
the language context:
In a French collation, cote < côte < coté < côté.
In the SQL Server Latin1 default , cote < coté < côte < côté
ASCII sorts (a binary collation)
Collation means assigning some order to the characters in an Alphabet, say, ASCII or Unicode etc.
Suppose you have 3 characters in your alphabet - {A,B,C}. You can define some example collations for it by assigning integral values to the characters
Example 1 = {A=1,B=2,C=3}
Example 2 = {C=1,B=2,A=3}
Example 3 = {B=1,C=2,A=3}
As a matter of fact, you can define n! collations on an Alphabet of size n. Given such an order, different sorting routines likes LSD/MSD string sorts make use of it for sorting strings.
Collation determines how your data is sorted and compared. It's very often important with regards to internazionalization, e.g. how do you sort japanese kanji?
If you google collation and sql server you'll find plenty of articles discussing it!
Reference is taken from this Article:
A collation is a set of rules for comparing characters in a character set. It has also ruled for sorting of characters and proper order of two characters varies from language to language.
A Collation compared two strings like, if a word is greater than another one, and sort accordingly.
If you are using “latin1” Character set, you can use “latin1_swedish_ci” Collation.
You have to choose right collation because wrong collation may affect your database performance.
http://en.wikipedia.org/wiki/Collation
Collation is the assembly of written information into a standard order. (...) A collation algorithm such as the Unicode collation algorithm defines an order through the process of comparing two given character strings and deciding which should come before the other.
The collation is how SQL server decides on how to sort and compare text.
See MSDN.

Various ways to query strings?

I'm working on an app that involves a lot of carefully designed strings. I'm in the process of designing the string format and for that I need to know what's possible and what's not when I'm querying the same data.
Which ones of these are possible with MySQL? .. and how do I accomplish them?
Results which contain this exact string -- not case sensitive
Results which contain this exact string -- case sensitive
Results which contain a similar string -- not case sensitive
Results which contain a similar string -- but individual characters must be of the same case
1. Results which contain this exact string -- not case sensitive
2. Results which contain this exact string -- case sensitive
These can both be accomplished. See the page documenting string functions in MySQL, in particular INSTR.
Case sensitivity is determined by the collation of a column. If you want values in a column to be compared in a case-sensitive fashion, then you give it a case-sensitive collation, as in the following example:
ALTER TABLE MyTable ADD MyColumn VARCHAR(10) CHARACTER SET ascii COLLATE ascii_general_cs NOT NULL
Conversely, if you want values in a column to be compared in a case-insensitive fashion, then give it a case-insensitive collation.
If you might want values in a column to be compared either way, then there are ways to do that too, though it's slightly more complicated.
3. Results which contain a similar string -- not case sensitive
4. Results which contain a similar string -- but individual characters must be of the same case
Depends what exactly you mean by "similar", but for some values of "similar" yes this is available. You will probably find it useful to consult the page I linked above.
SELECT this FROM that WHERE LOWER(string) = LOWER("blablabla");
SELECT this FROM that WHERE string = "blablabla";
SELECT this FROM that WHERE LOWER(string) LIKE LOWER("blablabla");
SELECT this FROM that WHERE string LIKE "blablabla";
Hope that's right.

When should I use the SQL Server Unicode 'N' Constant?

I've been looking into the use of the Unicode 'N' constant within my code, for example:
select object_id(N'VW_TABLE_UPDATE_DATA', N'V');
insert into SOME_TABLE (Field1, Field2) values (N'A', N'B');
After doing some reading around when to use it, and I'm still not entirely clear as to the circumstances under which it should and should not be used.
Is it as simple as using it when data types or parameters expect a unicode data type (as per the above examples), or is it more sophiticated than that?
The following Microsoft site gives an explanation, but I'm also a little unclear as to some of the terms it is using
http://msdn.microsoft.com/en-us/library/ms179899.aspx
Or to precis:
Unicode constants are interpreted as
Unicode data, and are not evaluated by
using a code page. Unicode constants
do have a collation. This collation
primarily controls comparisons and
case sensitivity. Unicode constants
are assigned the default collation of
the current database, unless the
COLLATE clause is used to specify a
collation.
What does it mean by:
'evaluated by using a code page'?
Collation?
I realise this is quite a broad question, but any links or help would be appreciated.
Thanks
Is it as simple as using it when data types or parameters expect a unicode data type?
Pretty much.
To answer your other points:
A code page is another name for encoding of a character set. For example, windows code page 1255 encodes Hebrew. This is normally used for 8bit encodings for characters. In terms of your question, strings may be evaluated using different code pages (so the same bit pattern may be interpreted as a Japanese character or an Arabic one, depending on what code page was used to evaluate it).
Collation is about how SQL Server is to order strings - this depends on code page, as you would order strings in different languages differently. See this article for an example.
National character nchar() and nvarchar() use two bytes per character and support international character set -- think internet.
The N prefix converts a string constant to two bytes per character. So if you have people from different countries and would like their names properly stored -- something like:
CREATE TABLE SomeTable (
id int
,FirstName nvarchar(50)
);
Then use:
INSERT INTO SomeTable
( Id, FirstName )
VALUES ( 1, N'Guðjón' );
and
SELECT *
FROM SomeTable
WHERE FirstName = N'Guðjón';

SQL server ignore case in a where expression

How do I construct a SQL query (MS SQL Server) where the "where" clause is case-insensitive?
SELECT * FROM myTable WHERE myField = 'sOmeVal'
I want the results to come back ignoring the case
In the default configuration of a SQL Server database, string comparisons are case-insensitive. If your database overrides this setting (through the use of an alternate collation), then you'll need to specify what sort of collation to use in your query.
SELECT * FROM myTable WHERE myField = 'sOmeVal' COLLATE SQL_Latin1_General_CP1_CI_AS
Note that the collation I provided is just an example (though it will more than likely function just fine for you). A more thorough outline of SQL Server collations can be found here.
Usually, string comparisons are case-insensitive. If your database is configured to case sensitive collation, you need to force to use a case insensitive one:
SELECT balance FROM people WHERE email = 'billg#microsoft.com'
COLLATE SQL_Latin1_General_CP1_CI_AS
I found another solution elsewhere; that is, to use
upper(#yourString)
but everyone here is saying that, in SQL Server, it doesn't matter because it's ignoring case anyway? I'm pretty sure our database is case-sensitive.
The top 2 answers (from Adam Robinson and Andrejs Cainikovs) are kinda, sorta correct, in that they do technically work, but their explanations are wrong and so could be misleading in many cases. For example, while the SQL_Latin1_General_CP1_CI_AS collation will work in many cases, it should not be assumed to be the appropriate case-insensitive collation. In fact, given that the O.P. is working in a database with a case-sensitive (or possibly binary) collation, we know that the O.P. isn't using the collation that is the default for so many installations (especially any installed on an OS using US English as the language): SQL_Latin1_General_CP1_CI_AS. Sure, the O.P. could be using SQL_Latin1_General_CP1_CS_AS, but when working with VARCHAR data, it is important to not change the code page as it could lead to data loss, and that is controlled by the locale / culture of the collation (i.e. Latin1_General vs French vs Hebrew etc). Please see point # 9 below.
The other four answers are wrong to varying degrees.
I will clarify all of the misunderstandings here so that readers can hopefully make the most appropriate / efficient choices.
Do not use UPPER(). That is completely unnecessary extra work. Use a COLLATE clause. A string comparison needs to be done in either case, but using UPPER() also has to check, character by character, to see if there is an upper-case mapping, and then change it. And you need to do this on both sides. Adding COLLATE simply directs the processing to generate the sort keys using a different set of rules than it was going to by default. Using COLLATE is definitely more efficient (or "performant", if you like that word :) than using UPPER(), as proven in this test script (on PasteBin).
There is also the issue noted by #Ceisc on #Danny's answer:
In some languages case conversions do not round-trip. i.e. LOWER(x) != LOWER(UPPER(x)).
The Turkish upper-case "İ" is the common example.
No, collation is not a database-wide setting, at least not in this context. There is a database-level default collation, and it is used as the default for altered and newly created columns that do not specify the COLLATE clause (which is likely where this common misconception comes from), but it does not impact queries directly unless you are comparing string literals and variables to other string literals and variables, or you are referencing database-level meta-data.
No, collation is not per query.
Collations are per predicate (i.e. something operand something) or expression, not per query. And this is true for the entire query, not just the WHERE clause. This covers JOINs, GROUP BY, ORDER BY, PARTITION BY, etc.
No, do not convert to VARBINARY (e.g.convert(varbinary, myField) = convert(varbinary, 'sOmeVal')) for the following reasons:
that is a binary comparison, which is not case-insensitive (which is what this question is asking for)
if you do want a binary comparison, use a binary collation. Use one that ends with _BIN2 if you are using SQL Server 2008 or newer, else you have no choice but to use one that ends with _BIN. If the data is NVARCHAR then it doesn't matter which locale you use as they are all the same in that case, hence Latin1_General_100_BIN2 always works. If the data is VARCHAR, you must use the same locale that the data is currently in (e.g. Latin1_General, French, Japanese_XJIS, etc) because the locale determines the code page that is used, and changing code pages can alter the data (i.e. data loss).
using a variable-length datatype without specifying the size will rely on the default size, and there are two different defaults depending on the context where the datatype is being used. It is either 1 or 30 for string types. When used with CONVERT() it will use the 30 default value. The danger is, if the string can be over 30 bytes, it will get silently truncated and you will likely get incorrect results from this predicate.
Even if you want a case-sensitive comparison, binary collations are not case-sensitive (another very common misconception).
No, LIKE is not always case-sensitive. It uses the collation of the column being referenced, or the collation of the database if a variable is compared to a string literal, or the collation specified via the optional COLLATE clause.
LCASE is not a SQL Server function. It appears to be either Oracle or MySQL. Or possibly Visual Basic?
Since the context of the question is comparing a column to a string literal, neither the collation of the instance (often referred to as "server") nor the collation of the database have any direct impact here. Collations are stored per each column, and each column can have a different collation, and those collations don't need to be the same as the database's default collation or the instance's collation. Sure, the instance collation is the default for what a newly created database will use as its default collation if the COLLATE clause wasn't specified when creating the database. And likewise, the database's default collation is what an altered or newly created column will use if the COLLATE clause wasn't specified.
You should use the case-insensitive collation that is otherwise the same as the collation of the column. Use the following query to find the column's collation (change the table's name and schema name):
SELECT col.*
FROM sys.columns col
WHERE col.[object_id] = OBJECT_ID(N'dbo.TableName')
AND col.[collation_name] IS NOT NULL;
Then just change the _CS to be _CI. So, Latin1_General_100_CS_AS would become Latin1_General_100_CI_AS.
If the column is using a binary collation (ending in _BIN or _BIN2), then find a similar collation using the following query:
SELECT *
FROM sys.fn_helpcollations() col
WHERE col.[name] LIKE N'{CurrentCollationMinus"_BIN"}[_]CI[_]%';
For example, assuming the column is using Japanese_XJIS_100_BIN2, do this:
SELECT *
FROM sys.fn_helpcollations() col
WHERE col.[name] LIKE N'Japanese_XJIS_100[_]CI[_]%';
For more info on collations, encodings, etc, please visit: Collations Info
No, only using LIKE will not work. LIKE searches values matching exactly your given pattern. In this case LIKE would find only the text 'sOmeVal' and not 'someval'.
A pracitcable solution is using the LCASE() function. LCASE('sOmeVal') gets the lowercase string of your text: 'someval'. If you use this function for both sides of your comparison, it works:
SELECT * FROM myTable WHERE LCASE(myField) LIKE LCASE('sOmeVal')
The statement compares two lowercase strings, so that your 'sOmeVal' will match every other notation of 'someval' (e.g. 'Someval', 'sOMEVAl' etc.).
You can force the case sensitive, casting to a varbinary like that:
SELECT * FROM myTable
WHERE convert(varbinary, myField) = convert(varbinary, 'sOmeVal')
What database are you on? With MS SQL Server, it's a database-wide setting, or you can over-ride it per-query with the COLLATE keyword.