The book I am reading says that
SQL Server supports two kinds of character data types—regular and Unicode. Regular data types include CHAR and VARCHAR, and Unicode data types include NCHAR and NVARCHAR. The difference is that regular characters use one byte of storage for each character, while Unicode characters require two bytes per character. With one byte of storage per character, a choice of a regular character type for a column restricts you to only one language in addition to English because only 256 (2^8) different characters can be represented by a single byte.
What I came to know by this is, if I use Varchar then I can use only one language(For ex. Hindi, an Indian Language) along with English.
But When I run this
Create Table NameTable
(
NameColumn varchar(MAX) COLLATE Indic_General_90_CI_AS_KS
)
It shows me error "Collation 'Indic_General_90_CI_AS_KS' is supported on Unicode data types only and cannot be applied to char, varchar or text data types."
So where have I misunderstood the author?
Thanks
You can find a list of collations here, along with the encoding type
Certain collations will apply only to 1-byte encodings -- 127 bits are used for normal ASCII and 128 are available for other characters -- hindi probably does not fit in 128 characters so a 1-byte collation does not apply to it.
You will have to use a nvarchar (or other 'n' prefixed character type).
-- edit --
French_CI_AS as a non-english example
One of the things collations enable is language and locale specific ordering of characters. Therefore French != latin.
Another example Arabic_CI_AS
This is a 1-byte encoding with the arabic alphabet.
Use this in your SQL Statement, considering "content" is a variable containing the Arabic string you want to insert:
update Table set contents = convert(text, N'" + content + "' collate Arabic_CI_AS)
It works fine.
you can use this
name = N'مرحبا كيف حالك'
Related
difference between character set and national character set oracle?
This is answered in Oracle's documentation: Choosing a Character Set
Character Set Encoding
When computer systems process characters, they use numeric codes instead of the graphical representation of the character. For example, when the database stores the letter A, it actually stores a numeric code that the computer system interprets as the letter. These numeric codes are especially important in a global environment because of the potential need to convert data between different character sets.
What is an Encoded Character Set?
You specify an encoded character set when you create a database.
Choosing a character set determines what languages can be represented
in the database. It also affects:
How you create the database schema
How you develop applications that process character data
How the database works with the operating system
Database performance
Storage required for storing character data
A group of characters (for example, alphabetic characters, ideographs,
symbols, punctuation marks, and control characters) can be encoded as
a character set. An encoded character set assigns a unique numeric
code to each character in the character set. The numeric codes are
called code points or encoded values. The following table shows
examples of characters that have been assigned a hexadecimal code
value in the ASCII character set.
Choosing a National Character Set
The term national character set refers to an alternative character set that enables you to store Unicode character data in a database that does not have a Unicode database character set. Another reason for choosing a national character set is that the properties of a different character encoding scheme may be more desirable for extensive character processing operations.
SQL NCHAR, NVARCHAR2, and NCLOB data types support Unicode data only. You can use either the UTF8 or the AL16UTF16 character set. The default is AL16UTF16.
Oracle recommends using SQL CHAR, VARCHAR2, and CLOB data types in AL32UTF8 database to store Unicode character data. Use of SQL NCHAR, NVARCHAR2, and NCLOB should be considered only if you must use a database whose database character set is not AL32UTF8.
In Oracle you have these two character sets mainly for historical reasons. In earlier times a typical setup was
Character Set: US7ASCII
National Character Set: WE8ISO8859P1
The character set was used for the generic part of your application. Then for your customers over all countries in the world you set the National Character Set according to the customer local requirement.
Nowadays with Unicode (i.e. AL32UTF8) there is actually no reason to use the National Character Set any more. More and more new native Oracle functions even do not support the National Character Set at all.
The only reason could be a heavy use of Asian characters where AL16UTF16 is more efficient in terms of space.
I know there are other similar questions but they are specific to certain special characters.
I am looking for a solution by which i can filter out all the records having unicode character in a column in DB2
As mustaccio said, all chars are unicode.
Probably, you are having problems when using text characters from other languages. It is very common that native-english speakers do not understand the problem of other characters, because the 26 character are standard in all encodings.
If you are having that problem, you have to formulate your problem in another way. Something like:
Taking all unicode characters as a set, I would like to know which characters fall outside the subset of the valid letters in my language (spanish in my case).
With this in mind, you can query your data, and perform an analysis of which values are not part of your language. You can do that with a regular expression.
CREATE OR REPLACE FUNCTION ENCODING_VALIDATION_SPANISH (
IN TEXT VARCHAR(256)
) RETURNS VARCHAR(32)
LANGUAGE SQL
PARAMETER CCSID UNICODE
SPECIFIC F_ENC_VAL_SP
DETERMINISTIC
NO EXTERNAL ACTION
READS SQL DATA
F_ENC_VAL_SP: BEGIN
DECLARE RET VARCHAR(32);
SET RET = XMLCAST (
-- The next line define the valid characters for the Spanish language.
XMLQUERY ('fn:matches($TEXT,"^[A-Za-z\- \.,\?\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DC\u00EC\u0147\u0148]*$")'
PASSING TEXT AS "TEXT"
) AS VARCHAR(32));
RETURN RET;
END F_ENC_VAL_SP #
I wrote an article, where I explain that problem: http://angocadb2.blogspot.fr/2013/09/validate-encoding-problems-in-db2.html
As well as CHAR (CHARACTER) and VARCHAR (CHARACTER VARYING), SQL offers an NCHAR (NATIONAL CHARACTER) and NVARCHAR (NATIONAL CHARACTER VARYING) type. In some databases, this is the better datatype to use for character (non-binary) strings:
In SQL Server, NCHAR is stored as UTF-16LE and is the only way to reliably store non-ASCII characters, CHAR being a single-byte codepage only;
In Oracle, NVARCHAR may be stored as UTF-16 or UTF-8 rather than a single-byte collation;
But in MySQL, NVARCHAR is VARCHAR, so it makes no difference, either type can be stored with UTF-8 or any other collation.
So, what does NATIONAL actually conceptually mean, if anything? The vendors' docs only tell you about what character sets their own DBMSs use, rather than the actual rationale. Meanwhile the SQL92 standard explains the feature even less helpfully, stating only that NATIONAL CHARACTER is stored in an implementation-defined character set. As opposed to a mere CHARACTER, which is stored in an implementation-defined character set. Which might be a different implementation-defined character set. Or not.
Thanks, ANSI. Thansi.
Should one use NVARCHAR for all character (non-binary) storage purposes? Are there currently-popular DBMSs in which it will do something undesirable, or which just don't recognise the keyword (or N'' literals)?
"NATIONAL" in this case means characters specific to different nationalities. Far east languages especially have so many characters that one byte is not enough space to distinguish them all. So if you have an english(ascii)-only app or an english-only field, you can get away using the older CHAR and VARCHAR types, which only allow one byte per character.
That said, most of the time you should use NCHAR/NVARCHAR. Even if you don't think you need to support (or potentially support) multiple languages in your data, even english-only apps need to be able to sensibly handle security attacks using foreign-language characters.
In my opinion, about the only place where the older CHAR/VARCHAR types are still preferred is for frequently-referenced ascii-only internal codes and data on platforms like Sql Server that support the distinction — data that would be the equivalent of an enum in a client language like C++ or C#.
Meanwhile the SQL92 standard explains
the feature even less helpfully,
stating only that NATIONAL CHARACTER
is stored in an implementation-defined
character set. As opposed to a mere
CHARACTER, which is stored in an
implementation-defined character set.
Which might be a different
implementation-defined character set.
Or not.
Coincidentally, this is the same "distinction" the C++ standard makes between char and wchar_t. A relic of the Dark Ages of Character Encoding when every language/OS combination has its own character set.
Should one use NVARCHAR for all
character (non-binary) storage
purposes?
It is not important whether the declared type of your column is VARCHAR or NVARCHAR. But it is important to use Unicode (whether UTF-8, UTF-16, or UTF-32) for all character storage purposes.
Are there currently-popular DBMSs in
which it will do something undesirable
Yes: In MS SQL Server, using NCHAR makes your (English) data take up twice as much space. Unfortunately, UTF-8 isn't supported yet.
EDIT: SQL Server 2019 finally introduced UTF-8 support.
In Oracle, the database character set can be a multi-byte character set, so you can store all manner of characters in there....but you need to understand and define the length of the columns appropriately (in either BYTES or CHARACTERS).
NVARCHAR gives you the option to have a database character set that is a single-byte (which reduces the potential for confusion between BYTE or CHARACTER sized columns) and use NVARCHAR as the multi-byte. See here.
Since I predominantly work with English data, I'd go with a multi-byte character set (UTF-8 mostly) as the database character set and ignore NVARCHAR. If I inherited an old database which was in a single-byte characterset and was too big to convert, I may use NVARCHAR. But I'd prefer not to.
I am storing first name and last name with up to 30 characters each. Which is better varchar or nvarchar.
I have read that nvarchar takes up twice as much space compared to varchar and that nvarchar is used for internationalization.
So what do you suggest should I use: nvarchar or varchar ?
Also please let me know about the performance of both. Is performance for both is same or they differ in performance. Because space is not too big issue. Issue is the performance.
Basically, nvarchar means you can handle lots of alphabets, not just regular English. Technically, it means unicode support, not just ANSI. This means double-width characters or approximately twice the space. These days disk space is so cheap you might as well use nvarchar from the beginning rather than go through the pain of having to change during the life of a product.
If you're certain you'll only ever need to support one language you could stick with varchar, otherwise I'd go with nvarchar.
This has been discussed on SO before here.
EDITED: changed ascii to ANSI as noted in comment.
First of all, to clarify, nvarchar stores unicode data while varchar stores ANSI (8-bit) data. They function identically but nvarchar takes up twice as much space.
Generally, I prefer storing user names using varchar datatypes unless those names have characters which fall out of the boundary of characters which varchar can store.
It also depends on database collation also. For e.g. you'll not be able to store Russian characters in a varchar field, if your database collation is LATIN_CS_AS. But, if you are working on a local application, which will be used only in Russia, you'd set the database collation to Russian. What this will do is that it will allow you to enter Russian characters in a varchar field, saving some space.
But, now-a-days, most of the applications being developed are international, so you'd yourself have to decide which all users will be signing up, and based on that decide the datatype.
I have red that nvarchar takes twice as varchar.
Yes.
nvarchar is used for internationalization.
Yes.
what u suggest should i use nvarchar or varchar?
It's depends upon the application.
By default go with nvarchar. There is very little reason to go with varchar these days, and every reason to go with nvarchar (allows international characters; as discussed).
varchar is 1 byte per character, nvarchar is 2 bytes per character.
You will use more space with nvarchar but there are many more allowable characters. The extra space is negligible, but you may miss those extra characters in the future. Even if you don't expect to require internationalization, people will often have non-English characters (e.g. é, ñ or ö) in their names.
I would suggest you use nvarchar.
I have red that nvarchar takes twice as varchar
Yes. According to Microsoft: "Storage size, in bytes, is two times the number of characters entered + 2 bytes" (http://msdn.microsoft.com/en-us/library/ms186939(SQL.90).aspx).
But storage is cheap; I never worry about a few extra bytes.
Also, save yourself trouble in the future and set the maximum widths to something more generous, like 100 characters. There is absolutely no storage overhead to this when you're using varchar or nvarchar (as opposed to char/nchar). You never know when you're going to encounter a triple-barrelled surname or some long foreign name which exceeds 30 characters.
nvarchar is used for internationalization.
nvarchar can store any unicode character, such as characters from non-Latin scripts (Arabic, Chinese, etc). I'm not sure how your application will be taking data (via the web, via a GUI toolkit, etc) but it's likely that whatever technology you're using supports unicode out of the box. That means that for any user-entered data (such as name) there is always the possibility of receiving non-Latin characters, if not now then in the future.
If I was building a new application, I would use nvarchar. Call it "future-proofing" if you like.
The nvarchar type is Unicode, so it can handle just about any character that exist in every language on the planet. The characters are stored as UTF-16 or UCS-2 (not sure which, and the differences are subtle), so each character uses two bytes.
The varchar type uses an 8 bit character set, so it's limited to the 255 characters of the character set that you choose for the field. There are different character set that handles different character groups, so it's usually sufficient for text local to a country or a region.
If varchar works for what you want to do, you should use that. It's a bit less data, so it's overall slightly faster. If you need to handle a wide variety of characters, use nvarchar.
on performance:
a reason to use varchar over nvarchar is that you can have twice as many characters in your indexes! index keys are limited to 900 bytes
on usability:
if the application is only ever intended for a english audience & contain english names, use varchar
Data to store: "Sunil"
varchar(5) takes 7B
nvarchar(5) takes 12B
I've been looking into the use of the Unicode 'N' constant within my code, for example:
select object_id(N'VW_TABLE_UPDATE_DATA', N'V');
insert into SOME_TABLE (Field1, Field2) values (N'A', N'B');
After doing some reading around when to use it, and I'm still not entirely clear as to the circumstances under which it should and should not be used.
Is it as simple as using it when data types or parameters expect a unicode data type (as per the above examples), or is it more sophiticated than that?
The following Microsoft site gives an explanation, but I'm also a little unclear as to some of the terms it is using
http://msdn.microsoft.com/en-us/library/ms179899.aspx
Or to precis:
Unicode constants are interpreted as
Unicode data, and are not evaluated by
using a code page. Unicode constants
do have a collation. This collation
primarily controls comparisons and
case sensitivity. Unicode constants
are assigned the default collation of
the current database, unless the
COLLATE clause is used to specify a
collation.
What does it mean by:
'evaluated by using a code page'?
Collation?
I realise this is quite a broad question, but any links or help would be appreciated.
Thanks
Is it as simple as using it when data types or parameters expect a unicode data type?
Pretty much.
To answer your other points:
A code page is another name for encoding of a character set. For example, windows code page 1255 encodes Hebrew. This is normally used for 8bit encodings for characters. In terms of your question, strings may be evaluated using different code pages (so the same bit pattern may be interpreted as a Japanese character or an Arabic one, depending on what code page was used to evaluate it).
Collation is about how SQL Server is to order strings - this depends on code page, as you would order strings in different languages differently. See this article for an example.
National character nchar() and nvarchar() use two bytes per character and support international character set -- think internet.
The N prefix converts a string constant to two bytes per character. So if you have people from different countries and would like their names properly stored -- something like:
CREATE TABLE SomeTable (
id int
,FirstName nvarchar(50)
);
Then use:
INSERT INTO SomeTable
( Id, FirstName )
VALUES ( 1, N'Guðjón' );
and
SELECT *
FROM SomeTable
WHERE FirstName = N'Guðjón';