which is the best collation for European + English language - sql-server-2005

HI There,
i am developing for European languages and also for English, the string are stored as NVARCHAR in sql server 2005.
so, which is the best collation to be used is "Latin1_General_CI_AS" covers all?
there are variations as well like
Latin1_General_CP1_CI_AS,Latin1_General_BIN,Latin1_General_BIN2 etc
comments\suggestions appreciated.
Regards
DEE

For general purpose sorting "General Latin1" is probably the best choice for western European and English languages.
I believe that if the code page (e.g., CP1) is not specified, then it defaults to code page 1252 (which is also what CP1 signifies). So my understanding is that Latin1_General_CI_AS and Latin1_General_CP1_CI_AS are equivalent. Given that, my opinion is that Latin1_General_CP1_CI_AS would be the better choice for clarity reasons. Whether you use CI_AS, CS_AS, or CI_AI is purely a usability issue based on whether you want case sensitivity and/or accent sensitivity. With CI, "a" == "A" and with AI, "á" == "â".
The _BIN and _BIN2 options signify that the collation will be binary based on the code point values. For sorting purposes, you probably do not want that because the order would not necessarily match any kind of dictionary order. However, if you are only using the index for searching for data, then one of those might be appropriate because it could be faster. Relatively little computation is necessary to convert a character value to the associated key value.
Edit As Martin points out in the comment, the code page will not matter unless you are using char, memo, or varchar. If you stick completely with Unicode (nchar, nvarchar, nmemo), then the code page will not come into play. If you translate a Unicode character to a single-byte character, though, it will be used.

Related

Inserting / Creating SQL Database Values [duplicate]

This question already has answers here:
A beginner's guide to SQL database design [closed]
(7 answers)
Closed 7 years ago.
First Name
Phone Number
Email
Address
For the above values, what would be the
Type
Length
Collation
Index
And why, as well as - is there a guide somewhere that I can use to determine these answers for myself? Thanks!
For names, should I use varchar / text / tinytext / blob? What is the typical name length?
If you're only going to support "normal" Western European / English names, then a (non-Unicode) varchar type should do
If you need to support Arabic, Hebrew, Japanese, Chinese, Korean or other Asian languages, then pick a Unicode string type to store those characters. Those typically use 2 bytes per character, but they're the only viable options if you need to support non-European languages and character sets.
As for length: pick a reasonable value, but don't use varchar(67), varchar(91), varchar(55) and so forth - try to settle on a few "default" lengths, like varchar(20) (for things like a phone number or a zip code), varchar(50) for a first name, and maybe varchar(100) for a last name / city name etc. Try to pick a few lengths, and use those throughout
E-Mails have a max length of 255 characters as defined in a RFC document
Windows file system paths (file names including path) have a Windows limitation of 260 characters
Use such knowledge to "tune" your string lengths. I would advise against just using blob type / TEXT / VARCHAR(MAX) for everything - those types are intended for really long text - use them sparingly, they're often accompanied by less than ideal access mechanisms and thus performance drawbacks.
Indexes: in general, don't over-index your tables - most often devs tend to have too many indexes on their tables, not fully understanding if and how those will be used (or not used). Every single index causes maintenance overhead when inserting, updating and deleting data - indexes aren't free, use them only if you really know what you're doing and see an overall performance benefit (of the whole system) when adding one.
This is somewhat dependent on database platform and the underlying requirements. These are typically all nvarchar or varchar types. You are the best at knowing the appropriate length given your requirements. For collation you can usually use the default but again it depends on your situation. I suggest reading up on indexes to address what indexes you need.
I agree with the others, you should do some reading on beginning SQL. I also suggest just implementing something and you can always change it after the fact. You'll learn a lot from just trying it and running into issues that you will then need to solve.
Good luck!

Is there a database that accepts special characters by default (without converting them)?

I am currently starting from scratch choosing a database to store data collected from a suite of web forms. Humans will be filling out these forms, and as they're susceptible to using international characters, especially those humans named José and François and أسامة and 布鲁斯, I wanted to start with a modern database platform that accepts all types (so to speak), without conversion.
Q: Does a databases exist, from the start, that accepts a wide diversity of the characters found in modern typefaces? If so, what are the drawbacks to a database that doesn't need to convert as much data in order to store that data?
// Anticipating two answers that I'm not looking for:
I found many answers to how someone could CONVERT (or encode) a special character, like é or a copyright symbol © into database-legal character set like © (for ©) so that a database can then accept it. This requires a conversion/translation layer to shuttle data into and out of the database. I know that has to happen on a level like the letter z is reducible to 1's and 0's, but I'm really talking about finding a human-readable database, one that doesn't need to translate.
I also see suggestions that people change the character encoding of their current database to one that accepts a wider range of characters. This is a good solution for someone who is carrying over a legacy system and wants to make it relevant to the wider range of characters that early computers, and the early web, didn't anticipate. I'm not starting with a legacy system. I'm looking for some modern database options.
Yes, there are databases that support large character sets. How to accomplish this is different from one database to another. For example:
In MS SQL Server you can use the nchar, nvarchar and ntext data types to store Unicode (UCS-2) text.
In MySQL you can choose UTF-8 as encoding for a table, so that it will be able to store Unicode text.
For any database that you consider using, you should look for Unicode support to see if can handle large character sets.

What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?

Also what's the vb.net function that will map all those different characters into their most standard form.
For example, tolower would map A and a to the same character right?
I need the same function for these characters
german
ß === s
Ü === u
Χιοσ == Χίος
Otherwise, sometimes I insert Χιοσ and latter when I insert Χίος mysql complaints that the ID already exist.
So I want to create a unique ID that maps all those strange characters into a more stable one.
For the encoding aspect of the thing, look at String.Normalize. Notice also its overload that specifies a particular normal form to which you want to convert the string, but the default normal form (C) will work just fine for nearly everyone who wants to "map all those different characters into their most standard form".
However, things get more complicated once you move into the database and deal with collations.
Unicode normalization does not ever change the character case. It covers only cases where the characters are basically equivalent - look the same1, mean the same thing. For example,
Χιοσ != Χίος,
The two sigma characters are considered non-equivalent, and the accented iota (\u1F30) is equivalent to a sequence of two characters, the plain iota (\u03B9) and the accent (\u0313).
Your real problem seems to be that you are using Unicode strings as primary keys, which is not the most popular database design practice. Such primary keys take up more space than needed and are bound to change over time (even if the initial version of the application does not plan to support that). Oh, and I forgot their sensitivity to collations. Instead of identifying records by Unicode strings, have the database schema generate meaningless sequential integers for you as you insert the records, and demote the Unicode strings to mere attributes of the records. This way they can be the same or different as you please.
It may still be useful to normalize them before storing for the purpose of searching and safer subsequent processing; but the particular case insensitive collation that you use will no longer restrict you in any way.
1Almost the same in case of compatibility normalization as opposed to canonical normalization.

How to convert foreign characters to English characters in SQL Query?

I have to create sql function that converts special Characters, International Characters(French, Chinese...) to english.
Is there any special function in sql, can i get??
Thanks for your help.
If you are after English names for the characters, that is an achievable goal, as they all have published names as part of the Unicode standard.
See for example:
http://www.unicode.org/ucd/
http://www.unicode.org/Public/UNIDATA/
Your task then is to simply turn the list of unicode characters into a table with 100,000 or so rows. Unfortunately the names you get will be things like ARABIC LIGATURE LAM WITH MEEM MEDIAL FORM.
On the other hand, if you want to actually translate the meaning, you need to be looking at machine translation software. Both Microsoft and Google have well-known cloud translation offerings and there are several other well-thought of products too.
I think the short answer is you can't unless you narrow your requirements a lot. It seems you want to take a text sample, A, and convert it into romanized text B.
There are a few problems to tackle:
Languages are typically not romanized on a single character basis. The correct pronunciation of a character is often dependent on the characters and words around it, and can even have special rules for just one word (learning English can be tough because it is filled with these, having borrowed words from many languages without normalizing the spelling).
Even if you code rules for every language you want to support you still have homographs, words that are spelled using exactly the same characters, but that have different pronunciations (and thus romanization) depending on what was meant - for example "sow" meaning a pig, or "sow" (where the w is silent) meaning to plant seeds.
And then you get into the problem of what language you are romanizing: Characters and even words are not unique to one language, but the actual meaning and romanization can vary. The fact that many languages include loan words from those language they share characters with complicates any attempt to automatically determine which language you are trying to romanize.
Given all these difficulties, what it is you actually want to achieve (what problem are you solving)?
You mention French among the languages you want to "convert" into English - yet French (with its accented characters) is already written in the roman alphabet. Even everyday words used in English occasionally make use of accented characters, though these are rare enough that the meaning and pronunciation is understood even if they are omitted (ex. résumé).
Is your problem really that you can't store unicode/extended ASCII? There are numerous ways to correct or work around that.

What does collation mean?

What does collation mean in SQL, and what does it do?
Collation can be simply thought of as sort order.
In English (and it's strange cousin, American), collation may be a pretty simple matter consisting of ordering by the ASCII code.
Once you get into those strange European languages with all their accents and other features, collation changes. For example, though the different accented forms of a may exist at disparate code points, they may all need to be sorted as if they were the same letter.
Besides the "accented letters are sorted differently than unaccented ones" in some Western European languages, you must take into account the groups of letters, which sometimes are sorted differently, also.
Traditionally, in Spanish, "ch" was considered a letter in its own right, same with "ll" (both of which represent a single phoneme), so a list would get sorted like this:
caballo
cinco
coche
charco
chocolate
chueco
dado
(...)
lámpara
luego
llanta
lluvia
madera
Notice all the words starting with single c go together, except words starting with ch which go after them, same with ll-starting words which go after all the words starting with a single l. This is the ordering you'll see in old dictionaries and encyclopedias, sometimes even today by very conservative organizations.
The Royal Academy of the Language changed this to make it easier for Spanish to be accomodated in the computing world. Nevertheless, ñ is still considered a different letter than n and goes after it, and before o. So this is a correctly ordered list:
Namibia
número
ñandú
ñú
obra
ojo
By selecting the correct collation, you get all this done for you, automatically :-)
Rules that tell how to compare and sort strings: letters order; whether case matters, whether diacritics matter etc.
For instance, if you want all letters to be different (say, if you store filenames in UNIX), you use UTF8_BIN collation:
SELECT 'A' COLLATE UTF8_BIN = 'a' COLLATE UTF8_BIN
---
0
If you want to ignore case and diacritics differences (say, for a search engine), you use UTF8_GENERAL_CI collation:
SELECT 'A' COLLATE UTF8_GENERAL_CI = 'ä' COLLATE UTF8_GENERAL_CI
---
1
As you can see, this collation (comparison rule) considers capital A and lowecase ä the same letter, ignoring case and diacritic differences.
Collation defines how you sort and compare string values
For example, it defines how to deal with
accents (äàa etc)
case (Aa)
the language context:
In a French collation, cote < côte < coté < côté.
In the SQL Server Latin1 default , cote < coté < côte < côté
ASCII sorts (a binary collation)
Collation means assigning some order to the characters in an Alphabet, say, ASCII or Unicode etc.
Suppose you have 3 characters in your alphabet - {A,B,C}. You can define some example collations for it by assigning integral values to the characters
Example 1 = {A=1,B=2,C=3}
Example 2 = {C=1,B=2,A=3}
Example 3 = {B=1,C=2,A=3}
As a matter of fact, you can define n! collations on an Alphabet of size n. Given such an order, different sorting routines likes LSD/MSD string sorts make use of it for sorting strings.
Collation determines how your data is sorted and compared. It's very often important with regards to internazionalization, e.g. how do you sort japanese kanji?
If you google collation and sql server you'll find plenty of articles discussing it!
Reference is taken from this Article:
A collation is a set of rules for comparing characters in a character set. It has also ruled for sorting of characters and proper order of two characters varies from language to language.
A Collation compared two strings like, if a word is greater than another one, and sort accordingly.
If you are using “latin1” Character set, you can use “latin1_swedish_ci” Collation.
You have to choose right collation because wrong collation may affect your database performance.
http://en.wikipedia.org/wiki/Collation
Collation is the assembly of written information into a standard order. (...) A collation algorithm such as the Unicode collation algorithm defines an order through the process of comparing two given character strings and deciding which should come before the other.
The collation is how SQL server decides on how to sort and compare text.
See MSDN.