SSIS error "UTF8" has no equivalant in encoding "WIN1252" - sql

I'm using SSIS package to extract the data from a Postgres database, but I'm getting following error one of the tables.
Character with byte sequence 0xef 0xbf 0xbd in encoding "UTF8" has no
equivalant in encoding "WIN1252
I have no idea how to resolve it. I made all the columns in the sql table to NVARCHAR(MAX) but still no use. Please provide the solution.

The full Unicode character set (as encoded in UTF8) contains tens of thousands of different characters. WIN1252 contains 256. Your data contains characters that cannot be represented in WIN1252.
You either need to export to a more useful character encoding, remove the "awkward" characters from the source database or do some (lossy) translation with SSIS itself (I believe "character map translation" is what you want to search for).
I would recommended first though spending am hour or so googling around the subject of Unicode, it's utf encodings and its relationship to the ISO and WIN character sets. That way you will understand which of the above to choose.

Related

Postgresql searching with non-ascii character using ILIKE

I have encountered interesting problem with character encodings in postgresql. The problem occurs with searching the table using ILIKE operator and using non-ascii characters ('ä','ö',..) in the search term. Seems that the search results differ depending on the encoding of the search term (as you can encode these non-ascii characters with many different way).
I know for a fact that I have in some tables (within 'character varying' columns) non-ascii characters with UTF-8 encoding and also with some other encoding (maybe latin-1, not sure). For example 'ä' character is present sometimes as correct UTF-8 representation 'C3 A4' but sometimes like this: '61 cc 88'. This is probably caused by the fact I imported some legacy data to the database that was not in UTF-8. This is not a problem most of time as they are presented in the Web application UI correctly anyway. Only the search is the issue. I.e. I cannot figure out a way to make the search find all the relevant entries in database as the results vary depending on the search term encoding (=basically problem is that search does not get the legacy data correctly as it has it's non-ascii characters mixed up).
Here are the facts on the application:
Postgresql 11.5 (via Amazon RDS)
Npgsql v4.1.2
.net core 3
react frontend
database encoding UTF8, Collation en_US.UTF-8, Character type en_US.UTF-8
Any ideas what to do?
Write a batch job to convert all data to UTF-8 and update database & guard further data inserts somehow for using nothing but UTF-8?
Make postgresql ILIKE ignore the character encoding?
Edit: Found good page on unicode normalization in Postgresql: https://www.2ndquadrant.com/en/blog/unicode-normalization-in-postgresql-13/

S3 -> Redshift cannot handle UTF8

We have a file in S3 that is loaded in to Redshift via the COPY command. The import is failing because a VARCHAR(20) value contains an Ä which is being translated into .. during the copy command and is now too long for the 20 characters.
I have verified that the data is correct in S3, but the COPY command does not understand the UTF-8 characters during import. Has anyone found a solution for this?
tl;dr
the byte length for your varchar column just needs to be larger.
Detail
Multi-byte characters (UTF-8) are supported in the varchar data type, however the length that is provided is in bytes, NOT characters.
AWS documentation for Multibyte Character Load Errors states the following:
VARCHAR columns accept multibyte UTF-8 characters, to a maximum of four bytes.
Therefore if you want the character Ä to be allowed, then you need to allow 2 bytes for this character, instead of 1 byte.
AWS documentation for VARCHAR or CHARACTER VARYING states the following:
... so a VARCHAR(120) column consists of a maximum of 120 single-byte characters, 60 two-byte characters, 40 three-byte characters, or 30 four-byte characters.
For a list of UTF-8 characters and their byte lengths, this is a good reference:
Complete Character List for UTF-8
Detailed information for the Unicode Character 'LATIN CAPITAL LETTER A WITH DIAERESIS' (U+00C4) can be found here.
Please check below link
http://docs.aws.amazon.com/redshift/latest/dg/multi-byte-character-load-errors.html
You should use ACCEPTINVCHARS in you copy command. Details here
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html#acceptinvchars
I have a similar experience which some only characters like Ä were not copied correctly when loading mysqldump data into our Redshift cluster. It was because the encoding of mysqldump was latin1 which is the default character set of mysql. It's better to check the character encoding of the files for COPY first. If the encoding of your files are not UTF-8, you have to encode your files.
Using "ACCEPTINVCHARS ESCAPE" in the copy command solved the issue for us with minor data alteration.
You need to increase the size of your varchar column. Check the stl_load_errors table, see what is the actual field value length for the failed rows and accordingly increase the size.
EDIT: Just realized this is a very old post, anyways if someone need it..

HSQLDB - How is the "VARCHAR" type storage handled?

My question must be quiet stupid for some of you, but I've been unable to find the answer directly on HSQLDB website, google, or here (I may I've missed something, but I dont think so, as there isn't much on the web regarding HSQLDB compared to other well known databases).
To explain in more details, my background is a more Oracle DB background... I'm starting with HSQLDB and I wondered, as we can't use type declaration such as:
"mycolumn VARCHAR(25 CHAR)"
"mycolumn VARCHAR(25 BYTE)"
How was the storage managed on HSQLDB as I have to use "mycolumn VARCHAR(25)" instead of the previous solutions. I would be glad if anyone got a good description or a link regarding how characters are stored, to avoid storage issues with special chars for example.
Thanks in advance !
Antoine
HSQLDB uses the Unicode character set with UTF-16 encoding. Therefore all possible characters can be stored in a CHAR, VARCHAR or CLOB column. The declaration size of a VARCHAR column refers to the maximum number of UTF-16 characters allowed.
The physical storage of VARCHAR data on disk is similar to UTF-8 and takes one byte per each Latin character but more than one for other characters. The user does not see this encoding and its only significance is the amount of disk space used for long VARCHAR data.

best character set and collation for European based website

I am going to be building a application which will be used by people all over Europe. I need to know which collation and character set would be best suited for user inputted data. Or should I make a separate table for each language. A article to something explaining this would be great.
Thanks :)
Character set, without doubt, UTF-8. Collation, I am not sure there is a good answer to that, but you might want to read this report.
Unicode is a very large character set including nearly all characters from nearly all languages.
There are a number of ways to store Unicode text as a sequence of bytes - these ways are called encodings. All Unicode encodings (well, all complete Unicode encodings) can store all Unicode text as a sequence of bytes, in some format - but the number of bytes that any given piece of text takes will depend on the encoding used.
UTF-8 is a Unicode encoding that is optimized for English and other languages which use very few characters outside the Latin alphabet. UTF-16 is a Unicode encoding which is possibly more appropriate for text in a variety of European languages. Java and .NET store all text in-memory (the String class) as UTF-16 encoded Unicode.

Restoring a utf8 encoded database to an iso 1 server - effect on strings

We are doing a migration of our Sybase database that has a utf8 encoding to a server with iso 1 encoding. We are just using char and Varchar for our strings. Will doing backup and restore not truncate any strings? I was thinking that Char and Varchar are just single byte characters.
Any characters outside the ASCII range will likely get malformed/corrupted when you save UTF-8 characters as ISO-8859-1. UTF-8 stores characters outside ASCII range in multiple bytes. I would rather setup the target table to use UTF-8 encoding since that's the encoding of today and the future.
You CANNOT migrate Unicode data to ISO-8859 data. 99.5% of Unicode characters cannot be represented in ISO-8859. If you happen to only have Latin-1 characters in your data, then it's a no-op; otherwise it is undefined whether your migration tool will choke, whether it will report success but corrupt your data, whether it will preserve what's possible and insert wrong characters for impossible-to-represent characters, whether it will omit some characters...
Remember that Unicode contains ten thousands of characters, and ISO only 256. What you are trying to do can have many outcomes, but "everything works correctly" is not one of them.