Is there a database that accepts special characters by default (without converting them)? - sql

I am currently starting from scratch choosing a database to store data collected from a suite of web forms. Humans will be filling out these forms, and as they're susceptible to using international characters, especially those humans named José and François and أسامة and 布鲁斯, I wanted to start with a modern database platform that accepts all types (so to speak), without conversion.
Q: Does a databases exist, from the start, that accepts a wide diversity of the characters found in modern typefaces? If so, what are the drawbacks to a database that doesn't need to convert as much data in order to store that data?
// Anticipating two answers that I'm not looking for:
I found many answers to how someone could CONVERT (or encode) a special character, like é or a copyright symbol © into database-legal character set like © (for ©) so that a database can then accept it. This requires a conversion/translation layer to shuttle data into and out of the database. I know that has to happen on a level like the letter z is reducible to 1's and 0's, but I'm really talking about finding a human-readable database, one that doesn't need to translate.
I also see suggestions that people change the character encoding of their current database to one that accepts a wider range of characters. This is a good solution for someone who is carrying over a legacy system and wants to make it relevant to the wider range of characters that early computers, and the early web, didn't anticipate. I'm not starting with a legacy system. I'm looking for some modern database options.

Yes, there are databases that support large character sets. How to accomplish this is different from one database to another. For example:
In MS SQL Server you can use the nchar, nvarchar and ntext data types to store Unicode (UCS-2) text.
In MySQL you can choose UTF-8 as encoding for a table, so that it will be able to store Unicode text.
For any database that you consider using, you should look for Unicode support to see if can handle large character sets.

Related

Creating a table in NexusDB with german umlauts?

I'm trying to import a CREATE TABLE statement in NexusDB.
The table name contains some german umlauts and so do some field names but I receive an error that there were some invalid characters in my statement (obviously the umlauts...).
My question is now: can somebody give a solution or any ideas to solve my problem?
It's not so easy to just change the umlauts into equivalent terms like ä -> ae or ö -> oe since our application has fixed table names every customer uses currently.
It is not a good idea to use characters outside what is normally permitted in the SQL standard. This will bite you not only in NexusDB, but in many other databases as well. Take special note that there is a good chance you will also run into problems when you want to access data via ODBC etc, as other environments may also have similar standard restrictions. My strong recommendation would be to avoid use of characters outside the SQL naming standard for tables, no matter which database is used.
However... having said all that, given that NexusDB is one of the most flexible database systems for the programmer (it comes with full source), there is already a solution. If you add an "extendedliterals" define to your database server project, then a larger array of characters are considered valid. For the exact change this enables, see the nxcValidIdentChars constant in the nxllConst.pas unit. The constant may also be changed if required.

Inserting / Creating SQL Database Values [duplicate]

This question already has answers here:
A beginner's guide to SQL database design [closed]
(7 answers)
Closed 7 years ago.
First Name
Phone Number
Email
Address
For the above values, what would be the
Type
Length
Collation
Index
And why, as well as - is there a guide somewhere that I can use to determine these answers for myself? Thanks!
For names, should I use varchar / text / tinytext / blob? What is the typical name length?
If you're only going to support "normal" Western European / English names, then a (non-Unicode) varchar type should do
If you need to support Arabic, Hebrew, Japanese, Chinese, Korean or other Asian languages, then pick a Unicode string type to store those characters. Those typically use 2 bytes per character, but they're the only viable options if you need to support non-European languages and character sets.
As for length: pick a reasonable value, but don't use varchar(67), varchar(91), varchar(55) and so forth - try to settle on a few "default" lengths, like varchar(20) (for things like a phone number or a zip code), varchar(50) for a first name, and maybe varchar(100) for a last name / city name etc. Try to pick a few lengths, and use those throughout
E-Mails have a max length of 255 characters as defined in a RFC document
Windows file system paths (file names including path) have a Windows limitation of 260 characters
Use such knowledge to "tune" your string lengths. I would advise against just using blob type / TEXT / VARCHAR(MAX) for everything - those types are intended for really long text - use them sparingly, they're often accompanied by less than ideal access mechanisms and thus performance drawbacks.
Indexes: in general, don't over-index your tables - most often devs tend to have too many indexes on their tables, not fully understanding if and how those will be used (or not used). Every single index causes maintenance overhead when inserting, updating and deleting data - indexes aren't free, use them only if you really know what you're doing and see an overall performance benefit (of the whole system) when adding one.
This is somewhat dependent on database platform and the underlying requirements. These are typically all nvarchar or varchar types. You are the best at knowing the appropriate length given your requirements. For collation you can usually use the default but again it depends on your situation. I suggest reading up on indexes to address what indexes you need.
I agree with the others, you should do some reading on beginning SQL. I also suggest just implementing something and you can always change it after the fact. You'll learn a lot from just trying it and running into issues that you will then need to solve.
Good luck!

Western European Characterset to Turkish in sql

I am having a serious issue with character encoding. To give some background:
I have turkish business users who enter some data on Unix screens in Turkish language.
My database NLS parameter is set to AMERICAN, WE8ISO8859P1 and Unix NLS_LANG to AMERICAN_AMERICA.WE8ISO8859P1.
Turkey business is able to see all the turkish characters on UNIX screens and TOAD while I'm not. I can only see them in Western European Character set.
At business end: ÖZER İNŞAAT TAAHHÜT VE
At our end : ÖZER ÝNÞAAT TAAHHÜT VE
If you notice the turkish characters İ and Ş are getting converted to ISO 8859-1 character set. However, all the settings(NLS paramaters in db and unix) are same at both end- ISO8859-1(Western European)
With some study, I can understand - Turkish machines can display turkish data by doing conversion in real-time(DB NLS settings are overridden by local NLS settings).
Now, I have a interface running in my db- have some PL/SQL scripts(run through shell script) that extracts some data from database and spool them to a .csv file on a unix path. Then that .csv file is transferred to an external system via MFT(Managed File transfer).
The problem is- Exract never conains any turkish character. Every turkish character is getting converted into Western European Characterset and goes like this to the external system which is treated as a case of data conversion/loss and my business is really unhappy.
Could anyone tell me - How could I retain all the turkish characters?
P.S. : External System's characterset could be set to ISP8859-9 charcterset.
Many thanks in advance.
If you are saying that your database character set is ISO-8859-1, i.e.
SELECT parameter, value
FROM v$nls_parameters
WHERE parameter = 'NLS_CHARACTERSET'
returns a value of WE8ISO8859P1 and you are storing the data in CHAR, VARCHAR, or VARCHAR2 columns, the problem is that the database character set does not support the full set of Turkish characters. If a character is not in the ISO-8859-1 codepage layout, it cannot be stored properly in database columns governed by the database character set. If you want to store Turkish data in an ISO-8859-1 database, you could potentially use the workaround characters instead (i.e. substituting S for Ş). If you want to support the full range of Turkish characters, however, you would need to move to a character set that supported all those characters-- either ISO-8859-9 or UTF-8 would be relatively common.
Changing the character set of your existing database is a non-trivial undertaking, however. There is a chapter in the Globalization Support Guide for whatever version of Oracle you are using that covers character set migration. If you want to move to a Unicode character set (which is generally the preferred approach rather than sticking with one of the single-byte ISO character sets), you can potentially leverage the Oracle Database Migration Assistant for Unicode.
At this point, you'll commonly see the objection that at least some applications are seeing the data "correctly" so the database must support the Turkish characters. The problem is that if you set up your NLS_LANG incorrectly, it is possible to bypass character set conversion entirely meaning that whatever binary representation a character has on the client gets persisted without modification to the database. As long as every process that reads the data configures their NLS_LANG identically and incorrectly, things may appear to work. However, you will very quickly find that some other application won't be able to configure their NLS_LANG identically incorrectly. A Java application, for example, will always want to convert the data from the database into a Unicode string internally. So if you're storing the data incorrectly in the database, as it sounds like you are, there is no way to get those applications to read it correctly. If you are simply using SQL*Plus in a shell script to generate the file, it is almost certainly possible to get your client configured incorrectly so that the data file appears to be correct. But it would be a very bad idea to let the existing misconfiguration persist. You open yourself up to much bigger problems in the future (if you're not already there) where different clients insert data in different character sets into the database making it much more difficult to disentangle, when you find that tools like the Oracle export utility have corrupted the data that is exported or when you want to use a tool that can't be configured incorrectly to view the data. You're much better served getting the problem corrected early.
Just setting your NLS_LANG parameter to AMERICAN_AMERICA.WE8ISO8859P9 is enough for Turkish language.

What are the characters that count as the same character under collation of UTF8 Unicode? And what VB.net function can be used to merge them?

Also what's the vb.net function that will map all those different characters into their most standard form.
For example, tolower would map A and a to the same character right?
I need the same function for these characters
german
ß === s
Ü === u
Χιοσ == Χίος
Otherwise, sometimes I insert Χιοσ and latter when I insert Χίος mysql complaints that the ID already exist.
So I want to create a unique ID that maps all those strange characters into a more stable one.
For the encoding aspect of the thing, look at String.Normalize. Notice also its overload that specifies a particular normal form to which you want to convert the string, but the default normal form (C) will work just fine for nearly everyone who wants to "map all those different characters into their most standard form".
However, things get more complicated once you move into the database and deal with collations.
Unicode normalization does not ever change the character case. It covers only cases where the characters are basically equivalent - look the same1, mean the same thing. For example,
Χιοσ != Χίος,
The two sigma characters are considered non-equivalent, and the accented iota (\u1F30) is equivalent to a sequence of two characters, the plain iota (\u03B9) and the accent (\u0313).
Your real problem seems to be that you are using Unicode strings as primary keys, which is not the most popular database design practice. Such primary keys take up more space than needed and are bound to change over time (even if the initial version of the application does not plan to support that). Oh, and I forgot their sensitivity to collations. Instead of identifying records by Unicode strings, have the database schema generate meaningless sequential integers for you as you insert the records, and demote the Unicode strings to mere attributes of the records. This way they can be the same or different as you please.
It may still be useful to normalize them before storing for the purpose of searching and safer subsequent processing; but the particular case insensitive collation that you use will no longer restrict you in any way.
1Almost the same in case of compatibility normalization as opposed to canonical normalization.

How do you know when to use varchar and when to use text in sql?

It seems like a very arbitrary decision.
Both can accomplish the same thing in most cases.
By limiting the varchar length seems to me like you're shooting yourself in the foot cause you never know how long of a field you will need.
Is there any specific guideline for choosing VARCHAR or TEXT for your string fields?
I will be using postgresql with the sqlalchemy orm framework for python.
In PostgreSQL there is no technical difference between varchar and text
You can see a varchar(nnn) as a text column with a check constraint that prohibits storing larger values.
So each time you want to have a length constraint, use varchar(nnn).
If you don't want to restrict the length of the data use text
This sentence is wrong:
By limiting the varchar length seems to me like you're shooting yourself in the foot cause you never know how long of a field you will need.
If you are saving, for example, MD5 hashes you do know how large the field is your storing and your storage becomes more efficient. Other examples are:
Usernames (64 max)
Passwords (128 max)
Zip codes
Addresses
Tags
Many more!
In brief:
Variable length fields save space, but because each field can have different length, it makes table operations slower
Fixed length fields make table operations fast, although must be large enough for the maximum expected input, so can use more space
Think of an analogy to arrays and linked lists, where arrays are fixed length fields, and linked lists are like varchars. Which is better, arrays or linked lists? Lucky we have both, because they are both useful in different situations, so too here.
In the most cases you do know what the max length of a string in a field is. In case of a first of lastname you don't need more then 255 characters for example. So by design you choose wich type to use, if you always use text you're wasting resources
Check this article on PostgresOnline, it also links to two other usefull articles.
Most problems with TEXT in PostgreSQL occur when you're using tools, applications and drivers that treat TEXT very different from VARCHAR because other databases behave very different with these two datatypes.
Database designers almost always know how many characters a column needs to hold. US delivery addresses need to hold up to 64 characters. (The US Postal Service publishes addressing guidelines that say so.) US ZIP codes are 5 characters long.
A database designer will look at representative sample data from her clients when she's specifying columns. She'll ask herself, questions like "What's the longest product name?" And when the answer is "70 characters", she won't make the column 3000 characters wide.
VARCHAR has a limit of 8k in SQL Server (I think). Most applications don't require nearly that much storage for a single column.