What does collation mean? - sql

What does collation mean in SQL, and what does it do?

Collation can be simply thought of as sort order.
In English (and it's strange cousin, American), collation may be a pretty simple matter consisting of ordering by the ASCII code.
Once you get into those strange European languages with all their accents and other features, collation changes. For example, though the different accented forms of a may exist at disparate code points, they may all need to be sorted as if they were the same letter.

Besides the "accented letters are sorted differently than unaccented ones" in some Western European languages, you must take into account the groups of letters, which sometimes are sorted differently, also.
Traditionally, in Spanish, "ch" was considered a letter in its own right, same with "ll" (both of which represent a single phoneme), so a list would get sorted like this:
caballo
cinco
coche
charco
chocolate
chueco
dado
(...)
lámpara
luego
llanta
lluvia
madera
Notice all the words starting with single c go together, except words starting with ch which go after them, same with ll-starting words which go after all the words starting with a single l. This is the ordering you'll see in old dictionaries and encyclopedias, sometimes even today by very conservative organizations.
The Royal Academy of the Language changed this to make it easier for Spanish to be accomodated in the computing world. Nevertheless, ñ is still considered a different letter than n and goes after it, and before o. So this is a correctly ordered list:
Namibia
número
ñandú
ñú
obra
ojo
By selecting the correct collation, you get all this done for you, automatically :-)

Rules that tell how to compare and sort strings: letters order; whether case matters, whether diacritics matter etc.
For instance, if you want all letters to be different (say, if you store filenames in UNIX), you use UTF8_BIN collation:
SELECT 'A' COLLATE UTF8_BIN = 'a' COLLATE UTF8_BIN
---
0
If you want to ignore case and diacritics differences (say, for a search engine), you use UTF8_GENERAL_CI collation:
SELECT 'A' COLLATE UTF8_GENERAL_CI = 'ä' COLLATE UTF8_GENERAL_CI
---
1
As you can see, this collation (comparison rule) considers capital A and lowecase ä the same letter, ignoring case and diacritic differences.

Collation defines how you sort and compare string values
For example, it defines how to deal with
accents (äàa etc)
case (Aa)
the language context:
In a French collation, cote < côte < coté < côté.
In the SQL Server Latin1 default , cote < coté < côte < côté
ASCII sorts (a binary collation)

Collation means assigning some order to the characters in an Alphabet, say, ASCII or Unicode etc.
Suppose you have 3 characters in your alphabet - {A,B,C}. You can define some example collations for it by assigning integral values to the characters
Example 1 = {A=1,B=2,C=3}
Example 2 = {C=1,B=2,A=3}
Example 3 = {B=1,C=2,A=3}
As a matter of fact, you can define n! collations on an Alphabet of size n. Given such an order, different sorting routines likes LSD/MSD string sorts make use of it for sorting strings.

Collation determines how your data is sorted and compared. It's very often important with regards to internazionalization, e.g. how do you sort japanese kanji?
If you google collation and sql server you'll find plenty of articles discussing it!

Reference is taken from this Article:
A collation is a set of rules for comparing characters in a character set. It has also ruled for sorting of characters and proper order of two characters varies from language to language.
A Collation compared two strings like, if a word is greater than another one, and sort accordingly.
If you are using “latin1” Character set, you can use “latin1_swedish_ci” Collation.
You have to choose right collation because wrong collation may affect your database performance.

http://en.wikipedia.org/wiki/Collation
Collation is the assembly of written information into a standard order. (...) A collation algorithm such as the Unicode collation algorithm defines an order through the process of comparing two given character strings and deciding which should come before the other.

The collation is how SQL server decides on how to sort and compare text.
See MSDN.

Related

How to use the smaller as operater on strings? [duplicate]

Today I viewed some query examples, and I found some string comparisons in the WHERE condition.
The comparison was made using the greater than (>) and less than (<) symbols, is this a possible way to compare strings in SQL? And how does it act? A string less than another one comes before in dictionary order? For example, ball is less than water? And this comparison is case sensitive? For example BALL < water, the uppercase character does affect these comparison?
I've googled for hours but I was not able to find nothing that can drive me out these doubt.
The comparison operators (including < and >) "work" with string values as well as numbers.
For MySQL
By default, string comparisons are not case sensitive and use the current character set. The default is latin1 (cp1252 West European), which also works well for English.
String comparisons will be case sensitive when the characterset collation of the strings being compared is case sensitive, i.e. the name of the character set ends in _cs rather than _ci. There's really no point in repeating all of the information that's available in MySQL Reference Manual here.
MySQL Comparison Operators Reference: http://dev.mysql.com/doc/refman/5.5/en/comparison-operators.html
More information about MySQL charactersets/collations: http://dev.mysql.com/doc/refman/5.5/en/charset.html
To answer the specific questions you asked:
Q: is this a possible way to compare strings in SQL?
A: Yes, in both MySQL and SQL Server
Q: and how does it act?
A: A comparison operator returns a boolean, either TRUE, FALSE or NULL.
Q: a string less than another one comes before in dictionary order? For example, ball is less than water?
A: Yes, because 'b' comes before 'w' in the characteset collation, the expression
'ball' < 'water'
will return TRUE. (This depends on the characterset and on the collation.
Q: and this comparison is case sensitive?
A: Whether a particular comparison is case sensitive or not depends on the database server; by default, both SQL Server and MySQL are case insensitive.
In MySQL it is possible to make string comparisons by specifying a characterset collation that is case sensitive (the characterset name will end in _cs rather than _ci)
Q: For example BALL < water, the upper case character does affect these comparison?
A: By default, in both SQL Server and MySQL, the expression
'BALL' < 'water'
would return TRUE.
In Microsoft SQL Server, collation determines to dictionary rules for comparing and sorting character data with regards to:
case sensitivity
accent sensitivity
width sensitivity
kana sensitivity
SQL Server also includes binary collations where comparison and sorting is done by binary code point rather than dictionary rules. Once can choose from many collations according to the desired sensitivity behavior. The default collation selected for Latin-based language locales during SQL installation is case insensitive and accent sensitive.
Collation is specified at the instance (during installation), database, and column level. Instance collation determines the collation of Instance-level objects like logins and database names as well as identifiers for variables, GOTO labels and temporary tables. Database collation (same as instance collation by default), determines the collation of database identifiers like table and column names as well as literal expressions. Column collation (same as database collation by default) determines the collation of that column.
It is certainly possible compare strings using '<', '>', '<>', ,LIKE, BETWEEN, etc.
if you are using Mybatis or XML based technique to execute SQL query, you have to use <![CDATA[your_symbol-here]]> to avoid that issue.
'ball' <![CDATA[<]]> 'water'
Look at the interesting output by SQL Server. The code was to compare the dates, it works fine all the time, but fails when year changes.
SELECT TOP 1 'The ResultSet should be empty' FROM SYS.columns
WHERE '01/04/2023' < '07/11/2022'

Ordering data why does lowercase appear last

When ordering data in sql developer, why does the data with allow lowercase letters appear last?
for example
Adam, Ben, Charlotte, Matthew, emily
Why isn't it: Adam, Ben, Charlotte, emily, Matthew?
I don't necessarily want the answer to just changing it but why does it happen? Is there a setting that is ticked to make it happen or does it do it by default unless you write a statement for it not to do it?
Ordering in a database uses a collation. Typically, the collation is specified at the database level, but can be at the table field level and the query level.
A collation is a ordering for the characters used by a culture in a writing system script. If the human writing system itself wouldn't define an ordering between two characters, the collation would likely fall back to a lexicographic ordering based on the character set of the collation. (Humans expect consistency even in the absence of rules that they are aware of.)
Many systems of collations include both case sensitive and case insensitive collations as well as accent sensitive and accent insensitive collations. (So, as many as 2 x 2 collations for the same culture and character set.)
So, somewhere your system has specified case sensitivity. You could order for your user (yourself, in this case?) by the preferred culture, case sensitivity, and accent sensitivity. But choose from collations for the same character set as the data because character set conversions can be lossy unless the source is a subset of the target.
See PL/SQL's documentation on collations.

SQL string comparison, greater than and less than operators

Today I viewed some query examples, and I found some string comparisons in the WHERE condition.
The comparison was made using the greater than (>) and less than (<) symbols, is this a possible way to compare strings in SQL? And how does it act? A string less than another one comes before in dictionary order? For example, ball is less than water? And this comparison is case sensitive? For example BALL < water, the uppercase character does affect these comparison?
I've googled for hours but I was not able to find nothing that can drive me out these doubt.
The comparison operators (including < and >) "work" with string values as well as numbers.
For MySQL
By default, string comparisons are not case sensitive and use the current character set. The default is latin1 (cp1252 West European), which also works well for English.
String comparisons will be case sensitive when the characterset collation of the strings being compared is case sensitive, i.e. the name of the character set ends in _cs rather than _ci. There's really no point in repeating all of the information that's available in MySQL Reference Manual here.
MySQL Comparison Operators Reference: http://dev.mysql.com/doc/refman/5.5/en/comparison-operators.html
More information about MySQL charactersets/collations: http://dev.mysql.com/doc/refman/5.5/en/charset.html
To answer the specific questions you asked:
Q: is this a possible way to compare strings in SQL?
A: Yes, in both MySQL and SQL Server
Q: and how does it act?
A: A comparison operator returns a boolean, either TRUE, FALSE or NULL.
Q: a string less than another one comes before in dictionary order? For example, ball is less than water?
A: Yes, because 'b' comes before 'w' in the characteset collation, the expression
'ball' < 'water'
will return TRUE. (This depends on the characterset and on the collation.
Q: and this comparison is case sensitive?
A: Whether a particular comparison is case sensitive or not depends on the database server; by default, both SQL Server and MySQL are case insensitive.
In MySQL it is possible to make string comparisons by specifying a characterset collation that is case sensitive (the characterset name will end in _cs rather than _ci)
Q: For example BALL < water, the upper case character does affect these comparison?
A: By default, in both SQL Server and MySQL, the expression
'BALL' < 'water'
would return TRUE.
In Microsoft SQL Server, collation determines to dictionary rules for comparing and sorting character data with regards to:
case sensitivity
accent sensitivity
width sensitivity
kana sensitivity
SQL Server also includes binary collations where comparison and sorting is done by binary code point rather than dictionary rules. Once can choose from many collations according to the desired sensitivity behavior. The default collation selected for Latin-based language locales during SQL installation is case insensitive and accent sensitive.
Collation is specified at the instance (during installation), database, and column level. Instance collation determines the collation of Instance-level objects like logins and database names as well as identifiers for variables, GOTO labels and temporary tables. Database collation (same as instance collation by default), determines the collation of database identifiers like table and column names as well as literal expressions. Column collation (same as database collation by default) determines the collation of that column.
It is certainly possible compare strings using '<', '>', '<>', ,LIKE, BETWEEN, etc.
if you are using Mybatis or XML based technique to execute SQL query, you have to use <![CDATA[your_symbol-here]]> to avoid that issue.
'ball' <![CDATA[<]]> 'water'
Look at the interesting output by SQL Server. The code was to compare the dates, it works fine all the time, but fails when year changes.
SELECT TOP 1 'The ResultSet should be empty' FROM SYS.columns
WHERE '01/04/2023' < '07/11/2022'

How to choose collation of SQL Server database

What if I want to use database to store different groups of special characters, how do I choose which collation to use? For example, if I set collation to Croatian and want to use Russian cyrillic, japanese characters except croatian special characters - which collation should I use?
Thanks,
Ilija
You'd use nvarchar to store the data
COLLATION defines sorting and comparing
That means you can store Croatian, Russian and Japanese in the same column.
But when you want to compare (WHERE MyColumn = #foo) or sort (ORDER BY MyColumn) you'll not get what you expect because of the collation.
However, you can use the COLLATE clause to change it if needed.
eg ORDER BY MyColumn COLLATE Japanese_something
I'd go for your most common option that covers most of your data. MSDN has this maybe useful article

I don't understand Collation? (Mysql, RDBMS, Character sets)

I Understand Character sets but I don't understand Collation. I know you get a default collation with every Character set in Mysql or any RDBMS but I still don't get it! Can someone please explain in layman terms?
Thank you in advance ;-)
The main point of a database collation is determining how data is sorted and compared.
Case sensitivity of string comparisons
SELECT "New York" = "NEW YORK";`
will return true for a case insensitive collation; false for a case sensitive one.
Which collation does which can be told by the _ci and _cs suffix in the collation's name. _bin collations do binary comparisons (strings must be 100% identical).
Comparison of umlauts/accented characters
the collation also determines whether accented characters are treated as their latin base counterparts in string comparisons.
SELECT "Düsseldorf" = "Dusseldorf";
SELECT "Èclair" = "Eclair";
will return true in the former case; false in the latter. You will need to read each collation's description to find out which is which.
String sorting
The collation influences the way strings are sorted.
For example,
Umlauts Ä Ö Ü are at the end of the alphabet in the finnish/swedish alphabet latin1_swedish_ci
they are treated as A O U in German DIN-1 sorting (latin_german1_ci)
and as AE OE UE in German DIN-2 sorting (latin_german2_ci). ("phone book" sorting)
In latin1_spanish_ci, "ñ" (n-tilde) is a separate letter between "n" and "o".
These rules will result in different sort orders when non-latin characters are used.
Using collations at runtime
You have to choose a collation for your table and columns, but if you don't mind the performance hit, you can force database operations into a certain collation at runtime using the COLLATE keyword.
This will sort table by the name column using German DIN-2 sorting rules:
SELECT name
FROM table
ORDER BY name COLLATE latin1_german2_ci;
Using COLLATE at runtime will have performance implications, as each column has to be converted during the query. So think twice before applying this do large data sets.
MySQL Reference:
Character Sets and Collations That MySQL Supports
Examples of the Effect of Collation
Collation issues
Collation is information about how strings should be sorted and compared.
It contains for example case sensetivity, e.g. whether a = A, special character considerations, e.g. whether a = á, and character order, e.g. whether O < Ö.