difference between character set and national character set oracle?
This is answered in Oracle's documentation: Choosing a Character Set
Character Set Encoding
When computer systems process characters, they use numeric codes instead of the graphical representation of the character. For example, when the database stores the letter A, it actually stores a numeric code that the computer system interprets as the letter. These numeric codes are especially important in a global environment because of the potential need to convert data between different character sets.
What is an Encoded Character Set?
You specify an encoded character set when you create a database.
Choosing a character set determines what languages can be represented
in the database. It also affects:
How you create the database schema
How you develop applications that process character data
How the database works with the operating system
Database performance
Storage required for storing character data
A group of characters (for example, alphabetic characters, ideographs,
symbols, punctuation marks, and control characters) can be encoded as
a character set. An encoded character set assigns a unique numeric
code to each character in the character set. The numeric codes are
called code points or encoded values. The following table shows
examples of characters that have been assigned a hexadecimal code
value in the ASCII character set.
Choosing a National Character Set
The term national character set refers to an alternative character set that enables you to store Unicode character data in a database that does not have a Unicode database character set. Another reason for choosing a national character set is that the properties of a different character encoding scheme may be more desirable for extensive character processing operations.
SQL NCHAR, NVARCHAR2, and NCLOB data types support Unicode data only. You can use either the UTF8 or the AL16UTF16 character set. The default is AL16UTF16.
Oracle recommends using SQL CHAR, VARCHAR2, and CLOB data types in AL32UTF8 database to store Unicode character data. Use of SQL NCHAR, NVARCHAR2, and NCLOB should be considered only if you must use a database whose database character set is not AL32UTF8.
In Oracle you have these two character sets mainly for historical reasons. In earlier times a typical setup was
Character Set: US7ASCII
National Character Set: WE8ISO8859P1
The character set was used for the generic part of your application. Then for your customers over all countries in the world you set the National Character Set according to the customer local requirement.
Nowadays with Unicode (i.e. AL32UTF8) there is actually no reason to use the National Character Set any more. More and more new native Oracle functions even do not support the National Character Set at all.
The only reason could be a heavy use of Asian characters where AL16UTF16 is more efficient in terms of space.
Related
What are NLS Strings in Oracle SQL which is shown as a difference between char and nchar as well as varchar2 and nvarchar2 data types ? Thank you
Every Oracle database instance has 2 available character set configurations:
The default character set (used by char, varchar2, clob etc. types)
The national character set (used by nchar, nvarchar2, nclob, etc. types)
Because the default character set could be configured to be a character set that doesn't support the full range of Unicode characters (such as Windows 1252), that's why Oracle provides this alternate character set configuration as well, that is guaranteed to support Unicode.
So let's say your database uses Windows-1252 for its default character set (not that I'm recommending it), and UTF-8 for the national (or alternate) character set...
Then if you have a table column where you don't need to support all kinds of weird unicode characters, then you can use a type such as varchar2 if you want to. And by doing so, you may be saving some space.
But if you do have a specific need to store and support unicode characters, then for that very specific instance, your column should be defined as nvarchar2, or some other type that uses the national character set.
That said, if your database's default character set is already a character set that supports Unicode, then using the nchar, nvarchar2, etc. types is not really necessary.
You can find more complete information on the topic here.
AFAIK, NLS stands for National Language Support which supports local languages (In other words supporting Localization). From Oracle Documentation
National Language Support (NLS) is a technology enabling Oracle
applications to interact with users in their native language, using
their conventions for displaying data
When you talk about "NLS" Settings it is not limited to character set configuration of your database.
You also have parameters like NLS_DATE_FORMAT, NLS_CURRENCY, NLS_CALENDAR, NLS_LANGUAGE, etc.
Most of them you can set on session level, i.e. individually for each user.
As well as CHAR (CHARACTER) and VARCHAR (CHARACTER VARYING), SQL offers an NCHAR (NATIONAL CHARACTER) and NVARCHAR (NATIONAL CHARACTER VARYING) type. In some databases, this is the better datatype to use for character (non-binary) strings:
In SQL Server, NCHAR is stored as UTF-16LE and is the only way to reliably store non-ASCII characters, CHAR being a single-byte codepage only;
In Oracle, NVARCHAR may be stored as UTF-16 or UTF-8 rather than a single-byte collation;
But in MySQL, NVARCHAR is VARCHAR, so it makes no difference, either type can be stored with UTF-8 or any other collation.
So, what does NATIONAL actually conceptually mean, if anything? The vendors' docs only tell you about what character sets their own DBMSs use, rather than the actual rationale. Meanwhile the SQL92 standard explains the feature even less helpfully, stating only that NATIONAL CHARACTER is stored in an implementation-defined character set. As opposed to a mere CHARACTER, which is stored in an implementation-defined character set. Which might be a different implementation-defined character set. Or not.
Thanks, ANSI. Thansi.
Should one use NVARCHAR for all character (non-binary) storage purposes? Are there currently-popular DBMSs in which it will do something undesirable, or which just don't recognise the keyword (or N'' literals)?
"NATIONAL" in this case means characters specific to different nationalities. Far east languages especially have so many characters that one byte is not enough space to distinguish them all. So if you have an english(ascii)-only app or an english-only field, you can get away using the older CHAR and VARCHAR types, which only allow one byte per character.
That said, most of the time you should use NCHAR/NVARCHAR. Even if you don't think you need to support (or potentially support) multiple languages in your data, even english-only apps need to be able to sensibly handle security attacks using foreign-language characters.
In my opinion, about the only place where the older CHAR/VARCHAR types are still preferred is for frequently-referenced ascii-only internal codes and data on platforms like Sql Server that support the distinction — data that would be the equivalent of an enum in a client language like C++ or C#.
Meanwhile the SQL92 standard explains
the feature even less helpfully,
stating only that NATIONAL CHARACTER
is stored in an implementation-defined
character set. As opposed to a mere
CHARACTER, which is stored in an
implementation-defined character set.
Which might be a different
implementation-defined character set.
Or not.
Coincidentally, this is the same "distinction" the C++ standard makes between char and wchar_t. A relic of the Dark Ages of Character Encoding when every language/OS combination has its own character set.
Should one use NVARCHAR for all
character (non-binary) storage
purposes?
It is not important whether the declared type of your column is VARCHAR or NVARCHAR. But it is important to use Unicode (whether UTF-8, UTF-16, or UTF-32) for all character storage purposes.
Are there currently-popular DBMSs in
which it will do something undesirable
Yes: In MS SQL Server, using NCHAR makes your (English) data take up twice as much space. Unfortunately, UTF-8 isn't supported yet.
EDIT: SQL Server 2019 finally introduced UTF-8 support.
In Oracle, the database character set can be a multi-byte character set, so you can store all manner of characters in there....but you need to understand and define the length of the columns appropriately (in either BYTES or CHARACTERS).
NVARCHAR gives you the option to have a database character set that is a single-byte (which reduces the potential for confusion between BYTE or CHARACTER sized columns) and use NVARCHAR as the multi-byte. See here.
Since I predominantly work with English data, I'd go with a multi-byte character set (UTF-8 mostly) as the database character set and ignore NVARCHAR. If I inherited an old database which was in a single-byte characterset and was too big to convert, I may use NVARCHAR. But I'd prefer not to.
One curious question. if i have a table with column with weblinks then what could be the datatype nvarchar or varchar. and what could be the size of that datatype?
In general, use nvarchar.
What are the main performance differences between varchar and nvarchar SQL Server data types?
RFC2616 says there's no maximum length of a URL, but 2000 is probably safe.
What is the maximum length of a URL in different browsers?
You should use nvarchar since chinese national characters were allowed in URL names and varchar can't handle those. Maximum URL size is 2083 characters (at least in IE), but you don't see those quite often. If you want to be completely sure that you can handle all URLs you shuold use nvarchar(2083).
I'd say varchar(1000) would be enough (unless you're going to store some Amazon URLs, of course) :). You don't need nvarchar because national URLs are experimental and are eventually converted to Latin with special characters.
Typically Web servers set fairly generous limits on length for genuine URLs e.g. up to 2048 or 4096 characters.
So, if you want to be safe and still don't want to use varchar(max), you can use varchar(2048) and varchar(4096), respectively.
For data with embedded URLs, you can use either varchar or nvarchar. The only difference between nvarchar and varchar is nvarchar is a varchar that natively supports unicode data. Also, the storage space is larger: varchar is 8-bit, while unicode is 16-bit, so double the space.
A future-proof solution would be nvarchar, since recent movements toward full unicode domain names are noticable, e.g. Russia Begins Registering Domains in Cyrillic.
URLs are subject to RFC1738:
URLs are written only with the graphic
printable characters of the
US-ASCII coded character set. The
octets 80-FF hexadecimal are not
used in US-ASCII, and the octets 00-1F
and 7F hexadecimal represent
control characters; these must be
encoded
This would place all 'weblinks' safely in the VARCHAR camp. With SQL Server 2008 R2 though you need not to worry anymore, since Unicode Compression is available (on Enterprise and DataCenter Editions).
The book I am reading says that
SQL Server supports two kinds of character data types—regular and Unicode. Regular data types include CHAR and VARCHAR, and Unicode data types include NCHAR and NVARCHAR. The difference is that regular characters use one byte of storage for each character, while Unicode characters require two bytes per character. With one byte of storage per character, a choice of a regular character type for a column restricts you to only one language in addition to English because only 256 (2^8) different characters can be represented by a single byte.
What I came to know by this is, if I use Varchar then I can use only one language(For ex. Hindi, an Indian Language) along with English.
But When I run this
Create Table NameTable
(
NameColumn varchar(MAX) COLLATE Indic_General_90_CI_AS_KS
)
It shows me error "Collation 'Indic_General_90_CI_AS_KS' is supported on Unicode data types only and cannot be applied to char, varchar or text data types."
So where have I misunderstood the author?
Thanks
You can find a list of collations here, along with the encoding type
Certain collations will apply only to 1-byte encodings -- 127 bits are used for normal ASCII and 128 are available for other characters -- hindi probably does not fit in 128 characters so a 1-byte collation does not apply to it.
You will have to use a nvarchar (or other 'n' prefixed character type).
-- edit --
French_CI_AS as a non-english example
One of the things collations enable is language and locale specific ordering of characters. Therefore French != latin.
Another example Arabic_CI_AS
This is a 1-byte encoding with the arabic alphabet.
Use this in your SQL Statement, considering "content" is a variable containing the Arabic string you want to insert:
update Table set contents = convert(text, N'" + content + "' collate Arabic_CI_AS)
It works fine.
you can use this
name = N'مرحبا كيف حالك'
I am storing first name and last name with up to 30 characters each. Which is better varchar or nvarchar.
I have read that nvarchar takes up twice as much space compared to varchar and that nvarchar is used for internationalization.
So what do you suggest should I use: nvarchar or varchar ?
Also please let me know about the performance of both. Is performance for both is same or they differ in performance. Because space is not too big issue. Issue is the performance.
Basically, nvarchar means you can handle lots of alphabets, not just regular English. Technically, it means unicode support, not just ANSI. This means double-width characters or approximately twice the space. These days disk space is so cheap you might as well use nvarchar from the beginning rather than go through the pain of having to change during the life of a product.
If you're certain you'll only ever need to support one language you could stick with varchar, otherwise I'd go with nvarchar.
This has been discussed on SO before here.
EDITED: changed ascii to ANSI as noted in comment.
First of all, to clarify, nvarchar stores unicode data while varchar stores ANSI (8-bit) data. They function identically but nvarchar takes up twice as much space.
Generally, I prefer storing user names using varchar datatypes unless those names have characters which fall out of the boundary of characters which varchar can store.
It also depends on database collation also. For e.g. you'll not be able to store Russian characters in a varchar field, if your database collation is LATIN_CS_AS. But, if you are working on a local application, which will be used only in Russia, you'd set the database collation to Russian. What this will do is that it will allow you to enter Russian characters in a varchar field, saving some space.
But, now-a-days, most of the applications being developed are international, so you'd yourself have to decide which all users will be signing up, and based on that decide the datatype.
I have red that nvarchar takes twice as varchar.
Yes.
nvarchar is used for internationalization.
Yes.
what u suggest should i use nvarchar or varchar?
It's depends upon the application.
By default go with nvarchar. There is very little reason to go with varchar these days, and every reason to go with nvarchar (allows international characters; as discussed).
varchar is 1 byte per character, nvarchar is 2 bytes per character.
You will use more space with nvarchar but there are many more allowable characters. The extra space is negligible, but you may miss those extra characters in the future. Even if you don't expect to require internationalization, people will often have non-English characters (e.g. é, ñ or ö) in their names.
I would suggest you use nvarchar.
I have red that nvarchar takes twice as varchar
Yes. According to Microsoft: "Storage size, in bytes, is two times the number of characters entered + 2 bytes" (http://msdn.microsoft.com/en-us/library/ms186939(SQL.90).aspx).
But storage is cheap; I never worry about a few extra bytes.
Also, save yourself trouble in the future and set the maximum widths to something more generous, like 100 characters. There is absolutely no storage overhead to this when you're using varchar or nvarchar (as opposed to char/nchar). You never know when you're going to encounter a triple-barrelled surname or some long foreign name which exceeds 30 characters.
nvarchar is used for internationalization.
nvarchar can store any unicode character, such as characters from non-Latin scripts (Arabic, Chinese, etc). I'm not sure how your application will be taking data (via the web, via a GUI toolkit, etc) but it's likely that whatever technology you're using supports unicode out of the box. That means that for any user-entered data (such as name) there is always the possibility of receiving non-Latin characters, if not now then in the future.
If I was building a new application, I would use nvarchar. Call it "future-proofing" if you like.
The nvarchar type is Unicode, so it can handle just about any character that exist in every language on the planet. The characters are stored as UTF-16 or UCS-2 (not sure which, and the differences are subtle), so each character uses two bytes.
The varchar type uses an 8 bit character set, so it's limited to the 255 characters of the character set that you choose for the field. There are different character set that handles different character groups, so it's usually sufficient for text local to a country or a region.
If varchar works for what you want to do, you should use that. It's a bit less data, so it's overall slightly faster. If you need to handle a wide variety of characters, use nvarchar.
on performance:
a reason to use varchar over nvarchar is that you can have twice as many characters in your indexes! index keys are limited to 900 bytes
on usability:
if the application is only ever intended for a english audience & contain english names, use varchar
Data to store: "Sunil"
varchar(5) takes 7B
nvarchar(5) takes 12B