Encoding issue in Postgres ERROR "UTF8" is it best to set encoding to UTF8 or to make the data WIN1252 compatible? - sql

I created a table importing a CSV file from an excel spreadsheet. When I try to run the select statement below I get the error.
test=# SELECT * FROM dt_master;
ERROR: character with byte sequence 0xc2 0x9d in encoding "UTF8" has no equivalent in encoding "WIN1252"
I have read the solution posted in this stack overflow post and was able to overcome the issue by setting the encoding to UTF8, so up to that point I am still able to keep working with the data. My question, however, is whether setting the encoding to UTF8 actually is solving the problem or it is just a workaround that and will create other problems down the road and I would be better off removing the conflicting characters and making the data WIN1252 compliant.
Thank you

You have a weird character in your database (Unicode code point 9D, a control character) that probably got there by mistake.
You have to set the client encoding to the encoding that your application expects; no other value will produce correct results, even if you get rid of the error. The error has a reason.
You have two choices:
Fix the data in the database. The character is very likely not what was intended.
Change the application to use LATIN1 or (better) UTF-8 internally and set the client encoding appropriately.
Using UTF-8 everywhere would have the advantage that you are safe from this kind of problem.

Related

How to import UTF8 table in another encoding(win1251, SQL_ASCII) with COPY()?

Prehistory: Hello, i saw many questions about encoding in postgres, but.
I have UFT8 table, and i'm using COPY function to import that table in CSV, and i need to make COPY with different encodings like WIN1251 and SQL_ASCII.
Problem: When in table i have characters that not supported in WIN1251/SQL_ASCII, i will got classic error
character with byte sequence 0xe7 0xb0 0xab in encoding "UTF8" has no equivalent in encoding "WIN1251"
I tried using "set client_encoding/ convert / convert_to" - no success.
Main question: Is there any way to do this without error using sql?
There is simply no way to convert 簫 into Windows-1252, so you can forget about that.
If you set the client encoding to SQL_ASCII, you will be able to load the data into an SQL_ASCII database, but that is of little use, since the database does not recognize it as a character, but three meaningless bytes above 127.

Illegal xml parsing import to sql mac roman

I have a xml that says it's encoding is UTF-8. When I use openxml to import data into sql, I always get "XML parsing: line xxxxxx, character xx, illegal xml character.
Right now I can go to each line and replace it with the a legal character and it goes well. Sometimes there maybe be more than 5 mac roman characters and it becomes tedious to replace. I am currently using notepad ++ and there is probably a way for this.
Can anyone suggest if anything can be done in sql level or does it have to checked before ran in sql?
So far, most of the characters found are, x95, x92, x96, xbc, xbd, xbo.
Thanks.
In your question, you did not specify whether illegal characters you had to remove were Unicode or not. Or whether the file was really expected to contain UTF-8 characters. Unlike for the ASCII, for UTF-8 some byte combinations are illegal, so if you declare the text file to be encoded in UTF-8, you might not be able to read it successfully till end (such a thing could never happen with ASCII).
So it is possible that by removal of <?xml version="1.0" encoding="UTF-8"?> you just declared some non-unicode encoding of your file (instead of previously declared UTF-8), so reading the data passed. You did not have many foreign characters like ľťčý in the file, did you? Normally, it is a must that you check what happened to those after the import. It might happen that your import passes without error, but city name Čadca becomes äadca and somebody will thank your company for rendering his address unreadable.

How to declare a SQL INSERT Statement with a Unicode letter [duplicate]

This question already has an answer here:
Can not insert German characters in Postgres
(1 answer)
Closed 9 years ago.
I have a sql statemwent, which contain a unicode specific sign. The unicode sign is ę in the polish word Przesunięcie. Please look at the following SQL INSERT Statement:
INSERT INTO res_bundle_props (res_bundle_id, value, name)
VALUES(2, 'Przesunięcie przystanku', 'category.test');
I work with the Postgres Database. In which way can i insert the polish word with the unicode letter?
Find what are the server and client encodings:
show server_encoding;
server_encoding
-----------------
UTF8
show client_encoding;
client_encoding
-----------------
UTF8
Then set the client to the same encoding as the server:
set client_encoding = 'UTF8';
SET
No special syntax is required so long as:
Your server_encoding includes those characters (if it's utf-8 it does);
Your client_encoding includes those characters;
Your client_encoding correctly matches the encoding of the bytes you're actually sending
The latter is the one that often trips people up. They think they can just change client_encoding with a SET client_encoding statement and it'll do some kind of magical conversion. That is not the case. client_encoding tells PostgreSQL "this is the encoding of the data you will receive from the client, and the encoding that the client expects to receive from you".
Setting client_encoding to utf-8 doesn't make the client actually send UTF-8. That depends on the client. Nor do you have to send utf-8; that string can also be represented in iso-8859-2, iso-8859-4 and iso-8859-10 among other encodings.
What's crucial is that you tell the server the encoding of the data you're sending. As it happens that string is the same in all three of the encodings mentioned, with the ę encoded as 0xae... but in utf-8 that'd be the two bytes 0xc4 0x99. If you send utf-8 to the server and tell it that it's iso-8859-2 the server can't tell you're wrong and will interpret it as Ä in iso-8859-2.
So... really, it depends on things like the system's default encoding, the encoding of any files/streams you're reading data from, etc. You have two options:
Set client_encoding appropriately for the data you're working with and the default display locale of the system. This is easiest for simple cases, but harder when dealing with multiple different encodings in input or output.
Set client_encoding to utf-8 (or the same as server_encoding) and make sure that you always convert all input data into the encoding you set client_encoding to before sending it. You must also convert all data you receive from Pg back.

Rails utf-8 problem

I there, I'm new to ruby (and rails) and having som problems when using Swedish letters in strings. In my action a create a instance variable like this:
#title = "Välkommen"
And I get the following error:
invalid multibyte char (US-ASCII)
syntax error, unexpected $end, expecting keyword_end
#title = "Välkommen"
^
What's happening?
EDIT: If I add:
# coding: utf-8
at the top of my controller it works. Why is that and how can I slove this "issue"?
See Joel spolsky's article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".
To quote the part that answers this questions concisely
The Single Most Important Fact About Encodings
If you completely forget everything I just explained, please remember
one extremely important fact. It does not make sense to have a string
without knowing what encoding it uses. You can no longer stick your
head in the sand and pretend that "plain" text is ASCII.
This is why you must tell ruby what encoding is used in your file. Since the encoding is not marked in some sort of metadata associated with your file, some software assumed ASCII until it knows better. Ruby 1.9 probably does so until your comment when it will stop, and restart reading the file now decoding it as utf-8.
Obviously, if you used some other Unicode encoding or some more local encoding for your ruby file, you would need to change the comment to indicate the correct encoding.
The "magic comment" in Ruby 1.9 (on which Rails 3 is based) tells the interpreter what encoding to expect. It is important because in Ruby 1.9, every string has an encoding. Prior to 1.9, every string was just a sequence of bytes.
A very good description of the issue is in James Gray's series of blog posts on Ruby and Unicode. The one that is exactly relevant to your question is http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings (but see the others because they are very good).
The important line from the article:
The first is the main rule of source Encodings: source files receive a US-ASCII Encoding, unless you say otherwise.
There are several places that can cause problems with utf-8 encoding.
but some tricks are to solve this problem:
make sure that every file in your project is utf-8 based (if you
are using rad rails, this is simple to accomplish: mark your project,
select properties, in the "text-file-encoding" box, select "other:
utf-8")
Be sure to put in your strange "å,ä,ö" characters in your files again
or you'll get a mysql error, because it will change your "å,ä,ö" to a
"square" (unknown character)
in your databases.yml set for each server environment (in this
example "development" with mysql)
development:
adapter: mysql
encoding: utf8
set a before filter in your application controller
(application.rb):
class ApplicationController < ActionController::Base
before_filter :set_charset
def set_charset
#headers["Content-Type"] = "text/html; charset=utf-8"
end
end
be sure to set the encoding to utf-8 in your mysql (I've only used
mysql.. so I don't know about other databases) for every table. If you
use mySQL Administrator you can do like this: edit table, press the
"table option" tab, change charset to "utf8" and collation to
"utf8_general_ci"
( Courtsey : kombatsanta )

Replace character in SQL results

This is from a Oracle SQL query. It has these weird skinny rectangle shapes in the database in places where apostrophes should be. (I wish we would could paste screen shots in here)
It looks like this when I copy and paste the results.
spouse�s
is there a way to write a SQL SELECT statement that searches for this character in the field and replaces it with an apostrophe in the results?
Edit: I need to change only the results in a SELECT statement for reporting purposes, I can't change the Database.
I ran this
select dump('�') from dual;
which returned
Typ=96 Len=3: 239,191,189
This seems to work so far
select translate('What is your spouse�s first name?', '�', '''') from dual;
but this doesn't work
select translate(Fieldname, '�', '''') from TableName
Select FN from TN
What is your spouse�s first name?
SELECT DUMP(FN, 1016) from TN
Typ=1 Len=33 CharacterSet=US7ASCII: 57,68,61,74,20,69,73,20,79,6f,75,72,20,73,70,6f,75,73,65,92,73,20,66,69,72,73,74,20,6e,61,6d,65,3f
EDIT:
So I have established that is the backquote character. I can't get the DB updated so I'm trying this code
SELECT REGEX_REPLACE(FN,"\0092","\0027") FROM TN
and I"m getting ORA-00904:"Regex_Replace":invalid identifier
This seems a problem with your charset configuracion. Check your NLS_LANG and others NLS_xxx enviroment/regedit values. You have to check the oracle server, your client and the client of the inserter of that data.
Try to DUMP the value. you can do it with a select as simple as:
SELECT DUMP(the_column)
FROM xxx
WHERE xxx
UPDATE: I think that before try to replace, look for the root of the problem. If this happens because a charset trouble you can get big problems with bad data.
UPDATE 2: Answering the comments. The problem may be is not on the database server side, may be is in the client side. The problem (if this is the problem) can be a translation on server to/from client comunication. It's for a server-client bad configuracion-coordination. For instance if the server has defined UTF8 charset and your client uses US7ASCII, then all acutes will appear as ?.
Another approach can be that if the server has defined UTF8 charset and your client also UTF8 but the application is not able to show UTF8 chars, then the problem is in the application side.
UPDATE 3: On your examples:
select translate('What. It works because the � is exactly the same char: You have pasted on both sides.
select translate(Fieldname. It does not work because the � is not stored on database, it's the char that the client receives may be because some translation occurs from the data table until it's showed to you.
Next step: Look in DUMP syntax and try to extract the codes for the mysterious char (from the table not pasting �!).
I would say there's a good chance the character is a single-tick "smart quote" (I hate the name). The smart quotes are characters 91-94 (using a Windows encoding), or Unicode U+2018, U+2019, U+201C, and U+201D.
I'm going to propose a front-end application-based, client-side approach to the problem:
I suspect that this problem has more to do with a mismatch between the font you are trying to display the word spouse�s with, and the character �. That icon appears when you are trying to display a character in a Unicode font that doesn't have the glyph for the character's code.
The Oracle database will dutifully return whatever characters were INSERTed into its' column. It's more up to you, and your application, to interpret what it will look like given the font you are trying to display your data with in your application, so I suggest investigating as to what this mysterious � character is that is replacing your apostrophes. Start by using FerranB's recommended DUMP().
Try running the following query to get the character code:
SELECT DUMP(<column with weird character>, 1016)
FROM <your table>
WHERE <column with weird character> like '%spouse%';
If that doesn't grab your actual text from the database, you'll need to modify the WHERE clause to actually grab the offending column.
Once you've found the code for the character, you could just replace the character by using the regex_replace() built-in function by determining the raw hex code of the character and then supplying the ASCII / C0 Controls and Basic Latin character 0x0027 ('), using code similar to this:
UPDATE <table>
set <column with offending character>
= REGEX_REPLACE(<column with offending character>,
"<character code of �>",
"'")
WHERE regex_like(<column with offending character>,"<character code of �>");
If you aren't familiar with Unicode and different ways of character encoding, I recommend reading Joel's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). I wasn't until I read that article.
EDIT: If your'e seeing 0x92, there's likely a charset mismatch here:
0x92 in CP-1252 (default Windows code page) is a backquote character, which looks kinda like an apostrophe. This code isn't a valid ASCII character, and it isn't valid in IS0-8859-1 either. So probably either the database is in CP-1252 encoding (don't find that likely), or a database connection which spoke CP-1252 inserted it, or somehow the apostrophe got converted to 0x92. The database is returning values that are valid in CP-1252 (or some other charset where 0x92 is valid), but your db client connection isn't expecting CP-1252. Hence, the wierd question mark.
And FerranB is likely right. I would talk with your DBA or some other admin about this to get the issue straightened out. If you can't, I would try either doing the update above (seems like you can't), or doing this:
INSERT (<normal table columns>,...,<column with offending character>) INTO <table>
SELECT <all normal columns>, REGEX_REPLACE(<column with offending character>,
"\0092",
"\0027") -- for ASCII/ISO-8859-1 apostrophe
FROM <table>
WHERE regex_like(<column with offending character>,"\0092");
DELETE FROM <table> WHERE regex_like(<column with offending character>,"\0092");
Before you do this you need to understand what actually happened. It looks to me that someone inserted non-ascii strings in the database. For example Unicode or UTF-8. Before you fix this, be very sure that this is actually a bug. The apostrophe comes in many forms, not just the "'".
TRANSLATE() is a useful function for replacing or eliminating known single character codes.