S3 -> Redshift cannot handle UTF8 - amazon-s3

We have a file in S3 that is loaded in to Redshift via the COPY command. The import is failing because a VARCHAR(20) value contains an Ä which is being translated into .. during the copy command and is now too long for the 20 characters.
I have verified that the data is correct in S3, but the COPY command does not understand the UTF-8 characters during import. Has anyone found a solution for this?

tl;dr
the byte length for your varchar column just needs to be larger.
Detail
Multi-byte characters (UTF-8) are supported in the varchar data type, however the length that is provided is in bytes, NOT characters.
AWS documentation for Multibyte Character Load Errors states the following:
VARCHAR columns accept multibyte UTF-8 characters, to a maximum of four bytes.
Therefore if you want the character Ä to be allowed, then you need to allow 2 bytes for this character, instead of 1 byte.
AWS documentation for VARCHAR or CHARACTER VARYING states the following:
... so a VARCHAR(120) column consists of a maximum of 120 single-byte characters, 60 two-byte characters, 40 three-byte characters, or 30 four-byte characters.
For a list of UTF-8 characters and their byte lengths, this is a good reference:
Complete Character List for UTF-8
Detailed information for the Unicode Character 'LATIN CAPITAL LETTER A WITH DIAERESIS' (U+00C4) can be found here.

Please check below link
http://docs.aws.amazon.com/redshift/latest/dg/multi-byte-character-load-errors.html
You should use ACCEPTINVCHARS in you copy command. Details here
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html#acceptinvchars

I have a similar experience which some only characters like Ä were not copied correctly when loading mysqldump data into our Redshift cluster. It was because the encoding of mysqldump was latin1 which is the default character set of mysql. It's better to check the character encoding of the files for COPY first. If the encoding of your files are not UTF-8, you have to encode your files.

Using "ACCEPTINVCHARS ESCAPE" in the copy command solved the issue for us with minor data alteration.

You need to increase the size of your varchar column. Check the stl_load_errors table, see what is the actual field value length for the failed rows and accordingly increase the size.
EDIT: Just realized this is a very old post, anyways if someone need it..

Related

Text format in Data bricks (AZURE) to CSV

I have special characters and florigen language in data set.
when I run a SQL ( Select * from table1 ), results are fine. I get in required format
eg output of SQL :
1 | 请回复邮件
2 | Don’t know
when the same is exported to CSV to my local machine those text changes to wired symbols
Exporter data :
1 | 请回å¤é‚®
2 | Don’t know
How do I get the same format to CSV as in SQL?
Ensure that UTF-8 encoding is used instead of ISO-8859-1/Windows-1252 encoding in the browser and editor.
It's better to utilise CHARACTER SET utf8mb4 and COLLATION utf8mb4 unicode 520 ci in the future. (A revised version of the Unicode collation is in the works.)
utf8mb4 is a superset of utf8, as it can support 4-byte utf8 codes, which are required by Emoji and some Chinese characters.
"UTF-8" outside of MySQL refers to all size encodings, thus it's basically the same as MySQL's utf8mb4, not utf8.
The data can't be trusted whether viewed with a tool or with SELECT. Too many of these clients, particularly browsers, attempt to compensate for erroneous encodings by displaying proper text even if the database is messed up. So, choose a table and column with some non-English content and work on it.
WHERE... SELECT col, HEX(col) FROM tbl
For correctly saved UTF-8, the HEX will be
20 (in any language) for a blank area
4x, 5x, 6x, or 7x for English
Accented letters in much of Western Europe should be Cxyy.
Dxyy in Cyrillic, Hebrew, and Farsi/Arabic
The majority of Asia uses Exyyzz
Emoji, although some Chinese use F0yyzzww.
There are few fixes here: Fixes for various Cases
After downloading the data form Data bricks, Open the csv in notepad and save as under save select the Encoding option UTF-8

SSIS error "UTF8" has no equivalant in encoding "WIN1252"

I'm using SSIS package to extract the data from a Postgres database, but I'm getting following error one of the tables.
Character with byte sequence 0xef 0xbf 0xbd in encoding "UTF8" has no
equivalant in encoding "WIN1252
I have no idea how to resolve it. I made all the columns in the sql table to NVARCHAR(MAX) but still no use. Please provide the solution.
The full Unicode character set (as encoded in UTF8) contains tens of thousands of different characters. WIN1252 contains 256. Your data contains characters that cannot be represented in WIN1252.
You either need to export to a more useful character encoding, remove the "awkward" characters from the source database or do some (lossy) translation with SSIS itself (I believe "character map translation" is what you want to search for).
I would recommended first though spending am hour or so googling around the subject of Unicode, it's utf encodings and its relationship to the ISO and WIN character sets. That way you will understand which of the above to choose.

HSQLDB - How is the "VARCHAR" type storage handled?

My question must be quiet stupid for some of you, but I've been unable to find the answer directly on HSQLDB website, google, or here (I may I've missed something, but I dont think so, as there isn't much on the web regarding HSQLDB compared to other well known databases).
To explain in more details, my background is a more Oracle DB background... I'm starting with HSQLDB and I wondered, as we can't use type declaration such as:
"mycolumn VARCHAR(25 CHAR)"
"mycolumn VARCHAR(25 BYTE)"
How was the storage managed on HSQLDB as I have to use "mycolumn VARCHAR(25)" instead of the previous solutions. I would be glad if anyone got a good description or a link regarding how characters are stored, to avoid storage issues with special chars for example.
Thanks in advance !
Antoine
HSQLDB uses the Unicode character set with UTF-16 encoding. Therefore all possible characters can be stored in a CHAR, VARCHAR or CLOB column. The declaration size of a VARCHAR column refers to the maximum number of UTF-16 characters allowed.
The physical storage of VARCHAR data on disk is similar to UTF-8 and takes one byte per each Latin character but more than one for other characters. The user does not see this encoding and its only significance is the amount of disk space used for long VARCHAR data.

Restoring a utf8 encoded database to an iso 1 server - effect on strings

We are doing a migration of our Sybase database that has a utf8 encoding to a server with iso 1 encoding. We are just using char and Varchar for our strings. Will doing backup and restore not truncate any strings? I was thinking that Char and Varchar are just single byte characters.
Any characters outside the ASCII range will likely get malformed/corrupted when you save UTF-8 characters as ISO-8859-1. UTF-8 stores characters outside ASCII range in multiple bytes. I would rather setup the target table to use UTF-8 encoding since that's the encoding of today and the future.
You CANNOT migrate Unicode data to ISO-8859 data. 99.5% of Unicode characters cannot be represented in ISO-8859. If you happen to only have Latin-1 characters in your data, then it's a no-op; otherwise it is undefined whether your migration tool will choke, whether it will report success but corrupt your data, whether it will preserve what's possible and insert wrong characters for impossible-to-represent characters, whether it will omit some characters...
Remember that Unicode contains ten thousands of characters, and ISO only 256. What you are trying to do can have many outcomes, but "everything works correctly" is not one of them.

How much difference does BLOB or TEXT make in comparison with VARCHAR()?

If I don't know the length of a text entry (e.g. a blog post, description or other long text), what's the best way to store it in MYSQL?
TEXT would be the most appropriate for unknown size text. VARCHAR is limited to 65,535 characters from MYSQL 5.0.3 and 255 chararcters in previous versions, so if you can safely assume it will fit there it will be a better choice.
BLOB is for binary data, so unless you expect your text to be in binary format it is the least suitable column type.
For more information refer to the Mysql documentation on string column types.
use TEXT if you want it treated as a character string, with a character set.
use BLOB if you want it treated as a binary string, without a character set.
I recommend using TEXT.