How to import UTF8 table in another encoding(win1251, SQL_ASCII) with COPY()? - sql

Prehistory: Hello, i saw many questions about encoding in postgres, but.
I have UFT8 table, and i'm using COPY function to import that table in CSV, and i need to make COPY with different encodings like WIN1251 and SQL_ASCII.
Problem: When in table i have characters that not supported in WIN1251/SQL_ASCII, i will got classic error
character with byte sequence 0xe7 0xb0 0xab in encoding "UTF8" has no equivalent in encoding "WIN1251"
I tried using "set client_encoding/ convert / convert_to" - no success.
Main question: Is there any way to do this without error using sql?

There is simply no way to convert 簫 into Windows-1252, so you can forget about that.
If you set the client encoding to SQL_ASCII, you will be able to load the data into an SQL_ASCII database, but that is of little use, since the database does not recognize it as a character, but three meaningless bytes above 127.

Related

Encoding issue in Postgres ERROR "UTF8" is it best to set encoding to UTF8 or to make the data WIN1252 compatible?

I created a table importing a CSV file from an excel spreadsheet. When I try to run the select statement below I get the error.
test=# SELECT * FROM dt_master;
ERROR: character with byte sequence 0xc2 0x9d in encoding "UTF8" has no equivalent in encoding "WIN1252"
I have read the solution posted in this stack overflow post and was able to overcome the issue by setting the encoding to UTF8, so up to that point I am still able to keep working with the data. My question, however, is whether setting the encoding to UTF8 actually is solving the problem or it is just a workaround that and will create other problems down the road and I would be better off removing the conflicting characters and making the data WIN1252 compliant.
Thank you
You have a weird character in your database (Unicode code point 9D, a control character) that probably got there by mistake.
You have to set the client encoding to the encoding that your application expects; no other value will produce correct results, even if you get rid of the error. The error has a reason.
You have two choices:
Fix the data in the database. The character is very likely not what was intended.
Change the application to use LATIN1 or (better) UTF-8 internally and set the client encoding appropriately.
Using UTF-8 everywhere would have the advantage that you are safe from this kind of problem.

How to download a csv from a query but keeping the original encoding in pgadmin

I am Brazilian and I am workin with files that are encoded in windows 1252, when I execut the queries the names are fine, but when I try to export the data to excel using the download CSV I am faceing a encoding problem and all the letters with accents are having problems
I want to know how to change the encoding or the collate in the download as cvs for queries so that it have the same encoding in impoted
The code I used to import the that is
COPY base_ans_02 FROM 'C:\Users\ben201907_SP.csv' DELIMITER ','
CSV HEADER encoding 'windows-1252';
and one example of erro is
AMIL ASSISTÊNCIA MÉDICA INTERNACIONAL S.A.
If you inserted the data in your table using the WIN1252 encoding and it is not the default of your client, you might wanna also make sure it knows which encoding it's going to deal with.
Just set the client encoding right before your COPY command and you should be fine
SET CLIENT_ENCODING=WIN1252;
COPY base_ans_02 TO 'path_to_file' DELIMITER ',' CSV HEADER;

How to get text bytes used by a string in Hive?

I have some data in Hive 1.2.1 table. I have to get raw bytes of a specific column. The column data is html raw in multiple languages. In order to get length of characters, I can use simple query like below
select baseurl, LENGTH(content) from clss limit 30;
Above query is ok for characters length the problem is for text other is English, their value is incorrect. For a Character in Arabic, it is saved as unicoded that's why character length is changed. Some characters are of two bytes and some are single byte.
Is there any builtin function to know bytes of text instead of characters ?
Function character_length(string str) was added in Jira HIVE-15979 And it says Fix versions 2.3.0. If you cannot upgrade your Hive (and this is quite risky), then try to download UDF source codes and build it, then add jar and create temporary function.
Download code: GenericUDFCharacterLength.java

Illegal xml parsing import to sql mac roman

I have a xml that says it's encoding is UTF-8. When I use openxml to import data into sql, I always get "XML parsing: line xxxxxx, character xx, illegal xml character.
Right now I can go to each line and replace it with the a legal character and it goes well. Sometimes there maybe be more than 5 mac roman characters and it becomes tedious to replace. I am currently using notepad ++ and there is probably a way for this.
Can anyone suggest if anything can be done in sql level or does it have to checked before ran in sql?
So far, most of the characters found are, x95, x92, x96, xbc, xbd, xbo.
Thanks.
In your question, you did not specify whether illegal characters you had to remove were Unicode or not. Or whether the file was really expected to contain UTF-8 characters. Unlike for the ASCII, for UTF-8 some byte combinations are illegal, so if you declare the text file to be encoded in UTF-8, you might not be able to read it successfully till end (such a thing could never happen with ASCII).
So it is possible that by removal of <?xml version="1.0" encoding="UTF-8"?> you just declared some non-unicode encoding of your file (instead of previously declared UTF-8), so reading the data passed. You did not have many foreign characters like ľťčý in the file, did you? Normally, it is a must that you check what happened to those after the import. It might happen that your import passes without error, but city name Čadca becomes äadca and somebody will thank your company for rendering his address unreadable.

How to declare a SQL INSERT Statement with a Unicode letter [duplicate]

This question already has an answer here:
Can not insert German characters in Postgres
(1 answer)
Closed 9 years ago.
I have a sql statemwent, which contain a unicode specific sign. The unicode sign is ę in the polish word Przesunięcie. Please look at the following SQL INSERT Statement:
INSERT INTO res_bundle_props (res_bundle_id, value, name)
VALUES(2, 'Przesunięcie przystanku', 'category.test');
I work with the Postgres Database. In which way can i insert the polish word with the unicode letter?
Find what are the server and client encodings:
show server_encoding;
server_encoding
-----------------
UTF8
show client_encoding;
client_encoding
-----------------
UTF8
Then set the client to the same encoding as the server:
set client_encoding = 'UTF8';
SET
No special syntax is required so long as:
Your server_encoding includes those characters (if it's utf-8 it does);
Your client_encoding includes those characters;
Your client_encoding correctly matches the encoding of the bytes you're actually sending
The latter is the one that often trips people up. They think they can just change client_encoding with a SET client_encoding statement and it'll do some kind of magical conversion. That is not the case. client_encoding tells PostgreSQL "this is the encoding of the data you will receive from the client, and the encoding that the client expects to receive from you".
Setting client_encoding to utf-8 doesn't make the client actually send UTF-8. That depends on the client. Nor do you have to send utf-8; that string can also be represented in iso-8859-2, iso-8859-4 and iso-8859-10 among other encodings.
What's crucial is that you tell the server the encoding of the data you're sending. As it happens that string is the same in all three of the encodings mentioned, with the ę encoded as 0xae... but in utf-8 that'd be the two bytes 0xc4 0x99. If you send utf-8 to the server and tell it that it's iso-8859-2 the server can't tell you're wrong and will interpret it as Ä in iso-8859-2.
So... really, it depends on things like the system's default encoding, the encoding of any files/streams you're reading data from, etc. You have two options:
Set client_encoding appropriately for the data you're working with and the default display locale of the system. This is easiest for simple cases, but harder when dealing with multiple different encodings in input or output.
Set client_encoding to utf-8 (or the same as server_encoding) and make sure that you always convert all input data into the encoding you set client_encoding to before sending it. You must also convert all data you receive from Pg back.