I am going to be building a application which will be used by people all over Europe. I need to know which collation and character set would be best suited for user inputted data. Or should I make a separate table for each language. A article to something explaining this would be great.
Thanks :)
Character set, without doubt, UTF-8. Collation, I am not sure there is a good answer to that, but you might want to read this report.
Unicode is a very large character set including nearly all characters from nearly all languages.
There are a number of ways to store Unicode text as a sequence of bytes - these ways are called encodings. All Unicode encodings (well, all complete Unicode encodings) can store all Unicode text as a sequence of bytes, in some format - but the number of bytes that any given piece of text takes will depend on the encoding used.
UTF-8 is a Unicode encoding that is optimized for English and other languages which use very few characters outside the Latin alphabet. UTF-16 is a Unicode encoding which is possibly more appropriate for text in a variety of European languages. Java and .NET store all text in-memory (the String class) as UTF-16 encoded Unicode.
Related
I have encountered interesting problem with character encodings in postgresql. The problem occurs with searching the table using ILIKE operator and using non-ascii characters ('ä','ö',..) in the search term. Seems that the search results differ depending on the encoding of the search term (as you can encode these non-ascii characters with many different way).
I know for a fact that I have in some tables (within 'character varying' columns) non-ascii characters with UTF-8 encoding and also with some other encoding (maybe latin-1, not sure). For example 'ä' character is present sometimes as correct UTF-8 representation 'C3 A4' but sometimes like this: '61 cc 88'. This is probably caused by the fact I imported some legacy data to the database that was not in UTF-8. This is not a problem most of time as they are presented in the Web application UI correctly anyway. Only the search is the issue. I.e. I cannot figure out a way to make the search find all the relevant entries in database as the results vary depending on the search term encoding (=basically problem is that search does not get the legacy data correctly as it has it's non-ascii characters mixed up).
Here are the facts on the application:
Postgresql 11.5 (via Amazon RDS)
Npgsql v4.1.2
.net core 3
react frontend
database encoding UTF8, Collation en_US.UTF-8, Character type en_US.UTF-8
Any ideas what to do?
Write a batch job to convert all data to UTF-8 and update database & guard further data inserts somehow for using nothing but UTF-8?
Make postgresql ILIKE ignore the character encoding?
Edit: Found good page on unicode normalization in Postgresql: https://www.2ndquadrant.com/en/blog/unicode-normalization-in-postgresql-13/
I'm using SSIS package to extract the data from a Postgres database, but I'm getting following error one of the tables.
Character with byte sequence 0xef 0xbf 0xbd in encoding "UTF8" has no
equivalant in encoding "WIN1252
I have no idea how to resolve it. I made all the columns in the sql table to NVARCHAR(MAX) but still no use. Please provide the solution.
The full Unicode character set (as encoded in UTF8) contains tens of thousands of different characters. WIN1252 contains 256. Your data contains characters that cannot be represented in WIN1252.
You either need to export to a more useful character encoding, remove the "awkward" characters from the source database or do some (lossy) translation with SSIS itself (I believe "character map translation" is what you want to search for).
I would recommended first though spending am hour or so googling around the subject of Unicode, it's utf encodings and its relationship to the ISO and WIN character sets. That way you will understand which of the above to choose.
I have a postgresql database I would like to convert to UTF-8.
The problem is that it is currently SQL_ASCII, so hasn't been doing any kind of encoding conversion on its input, and as such has ended up with data of a mix of encoding types in the tables. One row might contain values encoded as UTF-8, another might be ISO-8859-x, or Windows-125x, etc.
This has made performing a dump of the database, and converting it to UTF-8 with the intention of importing it into a fresh UTF-8 database, difficult. If the data were all of one encoding type, I could just run the dump file through iconv, but I don't think that approach works here.
Is the problem fundamentally down to knowing how each data is encoded? Here, where that is not known, can it be worked out, or even guessed? Ideally I'd love a script which would take a file, any file, and spit out valid UTF-8.
This is exactly the problem that Encoding::FixLatin was written to solve*.
If you install the Perl module then you'll also get the fix_latin command-line utility which you can use like this:
pg_restore -O dump_file | fix_latin | psql -d database
Read of the 'Limitations' section of the documentation to understand how it works.
[*] Note I'm assuming that when you say ISO-8859-x you mean ISO-8859-1 and when you say CP125x you mean CP1252 - because the mix of ASCII, UTF-8, Latin-1 and WinLatin-1 is a common case. But if you really do have a mixture of eastern and western encodings then sorry but you're screwed :-(
It is impossible without some knowledge of the data first. Do you know if it is a text message or people's names or places? In some particular language?
You can try to encode a line of a dump and apply some heuristic — for example try an automatic spell checker and choose an encoding that generates the lowest number of errors or the most known words etc.
You can use for example aspell list -l en (en for English, pl for Polish, fr for French etc.) to get a list of misspelled words. Then you can choose encoding which generates the least of them. You'd need to install corresponding dictionary package, for example "aspell-en" in my Fedora 13 Linux system.
I've seen exactly this problem myself, actually. The short answer: there's no straightforward algorithm. But there is some hope.
First, in my experience, the data tends to be:
99% ASCII
.9% UTF-8
.1% other, 75% of which is Windows-1252.
So let's use that. You'll want to analyze your own dataset, to see if it follows this pattern. (I am in America, so this is typical. I imagine a DB containing data based in Europe might not be so lucky, and something further east even less so.)
First, most every encoding out there today contains ASCII as a subset. UTF-8 does, ISO-8859-1 does, etc. Thus, if a field contains only octets within the range [0, 0x7F] (ie, ASCII characters), then it's probably encoded in ASCII/UTF-8/ISO-8859-1/etc. If you're dealing with American English, this will probably take care of 99% of your data.
On to what's left.
UTF-8 has some nice properties, in that it will either be 1 byte ASCII characters, OR everything after the first byte will be 10xxxxxx in binary. So: attempt to run your remaining fields through a UTF-8 decoder (one that will choke if you give it garbage.) On the fields it doesn't choke on, my experience has been that they're probably valid UTF-8. (It is possible to get a false positive here: we could have a tricky ISO-8859-1 field that is also valid UTF-8.)
Last, if it's not ASCII, and it doesn't decode as UTF-8, Windows-1252 seems to be the next good choice to try. Almost everything is valid Windows-1252 though, so it's hard to get failures here.
You might do this:
Attempt to decode as ASCII. If successful, assume ASCII.
Attempt to decode as UTF-8.
Attempt to decode as Windows-1252
For the UTF-8 and Windows-1252, output the table's PK and the "guess" decoded text to a text file (convert the Windows-1252 to UTF-8 before outputting). Have a human look over it, see if they see anything out of place. If there's not too much non-ASCII data (and like I said, ASCII tends to dominate, if you're in America...), then a human could look over the whole thing.
Also, if you have some idea about what your data looks like, you could restrict decodings to certain characters. For example, if a field decodes as valid UTF-8 text, but contains a "©", and the field is a person's name, then it was probably a false positive, and should be looked at more closely.
Lastly, be aware that when you change to a UTF-8 database, whatever has been inserting this garbage data in the past is probably still there: you'll need to track down this system and teach it character encoding.
I resolved using this commands;
1-) Export
pg_dump --username=postgres --encoding=ISO88591 database -f database.sql
and after
2-) Import
psql -U postgres -d database < database.sql
these commands helped me solve the problem of conversion SQL_ASCII - UTF-8
As well as CHAR (CHARACTER) and VARCHAR (CHARACTER VARYING), SQL offers an NCHAR (NATIONAL CHARACTER) and NVARCHAR (NATIONAL CHARACTER VARYING) type. In some databases, this is the better datatype to use for character (non-binary) strings:
In SQL Server, NCHAR is stored as UTF-16LE and is the only way to reliably store non-ASCII characters, CHAR being a single-byte codepage only;
In Oracle, NVARCHAR may be stored as UTF-16 or UTF-8 rather than a single-byte collation;
But in MySQL, NVARCHAR is VARCHAR, so it makes no difference, either type can be stored with UTF-8 or any other collation.
So, what does NATIONAL actually conceptually mean, if anything? The vendors' docs only tell you about what character sets their own DBMSs use, rather than the actual rationale. Meanwhile the SQL92 standard explains the feature even less helpfully, stating only that NATIONAL CHARACTER is stored in an implementation-defined character set. As opposed to a mere CHARACTER, which is stored in an implementation-defined character set. Which might be a different implementation-defined character set. Or not.
Thanks, ANSI. Thansi.
Should one use NVARCHAR for all character (non-binary) storage purposes? Are there currently-popular DBMSs in which it will do something undesirable, or which just don't recognise the keyword (or N'' literals)?
"NATIONAL" in this case means characters specific to different nationalities. Far east languages especially have so many characters that one byte is not enough space to distinguish them all. So if you have an english(ascii)-only app or an english-only field, you can get away using the older CHAR and VARCHAR types, which only allow one byte per character.
That said, most of the time you should use NCHAR/NVARCHAR. Even if you don't think you need to support (or potentially support) multiple languages in your data, even english-only apps need to be able to sensibly handle security attacks using foreign-language characters.
In my opinion, about the only place where the older CHAR/VARCHAR types are still preferred is for frequently-referenced ascii-only internal codes and data on platforms like Sql Server that support the distinction — data that would be the equivalent of an enum in a client language like C++ or C#.
Meanwhile the SQL92 standard explains
the feature even less helpfully,
stating only that NATIONAL CHARACTER
is stored in an implementation-defined
character set. As opposed to a mere
CHARACTER, which is stored in an
implementation-defined character set.
Which might be a different
implementation-defined character set.
Or not.
Coincidentally, this is the same "distinction" the C++ standard makes between char and wchar_t. A relic of the Dark Ages of Character Encoding when every language/OS combination has its own character set.
Should one use NVARCHAR for all
character (non-binary) storage
purposes?
It is not important whether the declared type of your column is VARCHAR or NVARCHAR. But it is important to use Unicode (whether UTF-8, UTF-16, or UTF-32) for all character storage purposes.
Are there currently-popular DBMSs in
which it will do something undesirable
Yes: In MS SQL Server, using NCHAR makes your (English) data take up twice as much space. Unfortunately, UTF-8 isn't supported yet.
EDIT: SQL Server 2019 finally introduced UTF-8 support.
In Oracle, the database character set can be a multi-byte character set, so you can store all manner of characters in there....but you need to understand and define the length of the columns appropriately (in either BYTES or CHARACTERS).
NVARCHAR gives you the option to have a database character set that is a single-byte (which reduces the potential for confusion between BYTE or CHARACTER sized columns) and use NVARCHAR as the multi-byte. See here.
Since I predominantly work with English data, I'd go with a multi-byte character set (UTF-8 mostly) as the database character set and ignore NVARCHAR. If I inherited an old database which was in a single-byte characterset and was too big to convert, I may use NVARCHAR. But I'd prefer not to.
We are doing a migration of our Sybase database that has a utf8 encoding to a server with iso 1 encoding. We are just using char and Varchar for our strings. Will doing backup and restore not truncate any strings? I was thinking that Char and Varchar are just single byte characters.
Any characters outside the ASCII range will likely get malformed/corrupted when you save UTF-8 characters as ISO-8859-1. UTF-8 stores characters outside ASCII range in multiple bytes. I would rather setup the target table to use UTF-8 encoding since that's the encoding of today and the future.
You CANNOT migrate Unicode data to ISO-8859 data. 99.5% of Unicode characters cannot be represented in ISO-8859. If you happen to only have Latin-1 characters in your data, then it's a no-op; otherwise it is undefined whether your migration tool will choke, whether it will report success but corrupt your data, whether it will preserve what's possible and insert wrong characters for impossible-to-represent characters, whether it will omit some characters...
Remember that Unicode contains ten thousands of characters, and ISO only 256. What you are trying to do can have many outcomes, but "everything works correctly" is not one of them.