Character encoding issues in MySQL - sql

In my database we have fields where the data is not readable. I now know why it happened but I don't know how to fix it.
I found a way to get the info back from the database:
SELECT id,
name
FROM projects
WHERE LENGTH(name) != CHAR_LENGTH(name);
One of the rows returned shows:
id | name
-------------------------
1008 | Cajón el Diablo
This should be:
id | name
-------------------------
1008 | Cajón el Diablo
Can somebody help me figure out how to fix this problem? How can I convert this using SQL? Is SQL not good? If not, how about Python?

Your mySQL data is most likely UTF-8 encoded.
The tool or client you are viewing the data with is either
Not talking to the mySQL server in UTF-8 (SET NAMES utf8)
Outputting UTF-8 characters in an environment that has an encoding different from UTF-8 (e.g. a web page encoded in ISO-8859-1).
You need to either specify the correct character set when connecting to the mySQL database, or convert the incoming characters so they can be output correctly.
For more information, you would have to tell us what collation your database and tables is in, and what you are using to look at the data.
If you want to get into the basics of this, this is very good reading: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Related

Text format in Data bricks (AZURE) to CSV

I have special characters and florigen language in data set.
when I run a SQL ( Select * from table1 ), results are fine. I get in required format
eg output of SQL :
1 | 请回复邮件
2 | Don’t know
when the same is exported to CSV to my local machine those text changes to wired symbols
Exporter data :
1 | 请回å¤é‚®
2 | Don’t know
How do I get the same format to CSV as in SQL?
Ensure that UTF-8 encoding is used instead of ISO-8859-1/Windows-1252 encoding in the browser and editor.
It's better to utilise CHARACTER SET utf8mb4 and COLLATION utf8mb4 unicode 520 ci in the future. (A revised version of the Unicode collation is in the works.)
utf8mb4 is a superset of utf8, as it can support 4-byte utf8 codes, which are required by Emoji and some Chinese characters.
"UTF-8" outside of MySQL refers to all size encodings, thus it's basically the same as MySQL's utf8mb4, not utf8.
The data can't be trusted whether viewed with a tool or with SELECT. Too many of these clients, particularly browsers, attempt to compensate for erroneous encodings by displaying proper text even if the database is messed up. So, choose a table and column with some non-English content and work on it.
WHERE... SELECT col, HEX(col) FROM tbl
For correctly saved UTF-8, the HEX will be
20 (in any language) for a blank area
4x, 5x, 6x, or 7x for English
Accented letters in much of Western Europe should be Cxyy.
Dxyy in Cyrillic, Hebrew, and Farsi/Arabic
The majority of Asia uses Exyyzz
Emoji, although some Chinese use F0yyzzww.
There are few fixes here: Fixes for various Cases
After downloading the data form Data bricks, Open the csv in notepad and save as under save select the Encoding option UTF-8

Reason why am I getting results querying a column with Data actually in another column, using like '%text%'

With Firebird 2.5.8, and a table with a dozen of blob fields, I have this weird behavior querying this way:
SELECT *
FROM TABLE
WHERE BLOBFIELD4 LIKE '%SOMETEXT%'
and I get results though SOMETEXT is actually in a different column and not in BLOBFIELD4 (happens with every blob column).
What am I missing?
Thanks for the data. I made few fast tests using latest IB Expert with Firebird 2.5.5 (what i had on hands).
It seems that you actually have much more data, than you might think you have.
First of all - it is a bad, dangerous practice to keep text data in columns marked as CHARSET NONE ! Make sure that your columns are marked with some reasonable charset, like Windows 1250 or UTF8 or something. And also that the very CONNECTION of your all applicationa (including development tools) to the database server also has some explicitly defined character set that suits your textual data.
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Or, if you want those BLOBs be seen as binary - then explitly create them as SUB_TYPE BINARY not SUB_TYPE TEXT
However, here is the simple script to run on your database.
alter table comm
add NF_VC VARCHAR(4000) CHARACTER SET UTF8,
add NF_BL BLOB SUB_TYPE 1 SEGMENT SIZE 4096 CHARACTER SET UTF8
then
update comm
set nf_vc = '**' || com1 || '**'
then
update comm
set nf_bl = '##' || nf_vc || '##'
Notice, i intentionally force Firebird to do conversion BLOB -> VARCHAR -> BLOB.
Just to be on a safe side.
Now check some data.
select id_comm, nf_vc
from comm where
nf_vc containing 'f4le dans 2 ans'
and
select id_comm, nf_bl
from comm where
nf_bl containing 'f4le dans 2 ans'
What do you see now?
On the first picture we see that very mystery - the line is selected, but we can not see your search pattern in it, the "f4le dans 2 ans".
BUT !!!
Can you see the marks, the double asterisks, the ** ?
Yes, you can, in the beginning! But you can not see them in the ending!!!
That means, you DO NOT see the whole text, but only some first part of it!
On the second picture - you see the very same row ID=854392, but re-converted back to BLOB and additionally marked with ## at both ends.
Can you see the marks on both start and end?
Can you see your search pattern?
Yes and yes - if you look at the grid row (white).
No and no - if you look and the tooltip (yellow).
So, again, the data you search for - it DOES exist. But you just fail to see it for some reason.
Now, when may be a typical reason the string is not displayed completely?
It can be the zero-value byte (or several bytes, UNICODE codepoint), the way C language marks the end of line, the custom that is vastly used in Windows and many libraries and programs. Or maybe some other unusual value (EOF, EOT, -1, etc), that makes those programs you use falsely detect the end of the text where it did not actually ended yet.
Look at the two screenshots again, where is that, that lines start to differ? it is after \viewkind4 ... \par} and before pard. Notice the weird anomaly! that said pard should start with reversed slash - \ - to be a vaild RTF command. But it is instead prepended with something invisible, something blank. What can it be?...
Let us go back to your original query in your comments.
Also, it is bad practice to put important details into comments! They are hard to find there for any person, that was not tracking the story from the very start. And the more comments added, the harder it gets. The proper avenue for you would have been to EDIT the question adding the new data into the quesiton body, and then adding a comment (for notification sake) saying the question was edited. Please, in future, add new data that way.
select id_comm, COM1
from comm where
COM1 containing 'f4le dans 2 ans'
On the first glance our fishing ended with nothing, we see the text that does not have your pattern, ending at that very \par}.
But is it so? Switch into binary view, and....
Voila! What is there before the found-lost-found-again pard? there is that very ZERO BYTE i talked about earlier.
So, what did happenned, to wrap it up.
Firebird is correct, the data is found because the data is really there, in the BLOBs.
Your applications, reading the data, are not correct. Being confused with zero byte they show you only part of data, not all of data.
Your application, writing the data, might be not correct. Or the data itself.
How did that zero byte ended there? Why RTF structure was corrupt, lacking the reversed slash before pard? Was data size, you passed to the server when inserting that data, larger than it should had been, passing some garbage after a meaningful data? Was data size correct, but the data contents corrupt before inserting?
Something is fishy there. I do not think RTF Specifications explicitly prohibits zero byte, but having it is very untypical, because it triggers bugs like this in way too many applications and libraries.
P.S. the design of the table having MANY columns with BLOB types seems poor.
"wide" tables often lead to problems in future development and maintenance.
While it is not the essense of your quesiton, but please do think about remaking this table into a narrow one, and save your data as a number of one-BLOB rows.
It will give you some fixed added work now, but probably would save you from a snowballing problems in future.

I need a generic PL/SQL code for getting converted the special characters into standard ASCII characters as follows:- ÜNLÜ--> UNLU, JÓNÁS-- JONAS etc

Tried working the script but special characters are getting changed to question mark.
Thanks to #Wernfried Domscheit for pointing out the flaws in my answer that may cause it not to work for you. I have now edited my answer to address those issues.
Firstly, in order to see and enter the accented characters, you need to have your client system working in a character set that supports these characters. US ASCII 7-bit does not support accented characters. (Explanation here.)
UTF-8 is now by far the most popular character set on the internet and is becoming more popular in commercial systems, because it does support just about every character system on the planet. Other character sets that support accented characters include the Windows-12xx family and the ISO-8859 family. If you can tell us more about the client system (Windows? Mac? UNIX?) and the application you are using to access the database, we can be more specific.
I can reproduce the symptoms of your problem and solve it in my case.
First of all I check the server characterset:
select * from nls_database_parameters where parameter like '%CHARACTERSET%';
PARAMETER VALUE
---------------------- ----------
NLS_CHARACTERSET AL32UTF8
NLS_NCHAR_CHARACTERSET AL16UTF16
So varchar2 columns will be encoded in UTF-8 on my server.
I'm running Oracle on Linux with $LANG=en.US.UTF-8 on my client. I can confuse the client by defining $NLS_LANG (for the client) to use an ISO-8859 character set:
$ export NLS_LANG=ENGLISH_AMERICA.WE8ISO8859P1
Then in SQL*Plus I select a varchar2 column:
select word from test;
and the result is:
WORD
--------------------------------
�B�D�FGH�J
The question marks (actually "unknown character") are highlighting the mismatch between the characters I selected and what I told the client to expect.
If, at the operating system prompt, I set $NLS_LANG to match the client set-up, like this:
$ export NLS_LANG=ENGLISH_AMERICA.AL32UTF8
and run exactly the same query on the same data in SQL*Plus:
select word from test;
the result is:
WORD
--------------------------------
ÁBÇDÉFGHÍJ
If your server is storing accented characters correctly then it must be using a character set that supports these (examples above). Your client also needs to support a character set that can handle accented characters and your NLS_LANG setting need to match what the client can support. How you do that will depend on what client system you are using.
When you have a client that can display, and allow you to enter, accented characters, then you can solve your original problem. You don't need a PL/SQL function to do the conversion, you simply use the Oracle translate function, like this:
select word, translate(word, 'ÁÇÉÍ', 'ACEI') as no_accents from test;
WORD NO_ACCENTS
---------- ----------
ÁBÇDÉFGHÍJ ABCDEFGHIJ

Storing and returning emojis

What's the simplest way to write and, then, read Emoji symbols in Oracle table?
Currently I have this situation:
iOS client pass encoded Emojis: One%20more%20time%20%F0%9F%98%81%F0%9F%98%94%F0%9F%98%8C%F0%9F%98%92. For example, %F0%9F%98%81 means 😁;
Column type is nvarchar2(2000), so when view saved text via Oracle SQL Developer it looks like: One more time ????????.
This seems more a client problem than a database problem. Certain iOs programs are capable of interpreting that string and show an image instead of that string.
SQL Developer does not do that.
As long as the data stored in the database is the same as the data retrieved from the database, you have no problem.
After all, we do BASE64 encoding/decoding of the text. It’s suitable for small texts.
In MySQL the character set needs to be set to UTF-16 to be able to save emojis, I assume Oracle would need the same ch

Informix 7.3 isql insert statement - text/blob/clob field insert error

Is a way around this??
I am trying to insert some data into a table whose structure is:
Column name Type Nulls
crs_no char(12) no
cat char(4) no
pr_cat char(1) yes
pr_sch char(1) yes
abstr text yes
The type of the last field reads 'text', but when trying to insert into this table, I get this error:
insert into crsabstr_rec values ("COMS110","UG09","Y","Y","CHEESE");
617: A blob data type must be supplied within this context.
Error in line 1
Near character position 66
So this field is some sort of blob apparently, but won't take inserts (or updates). Normally, these records are inserted into a GUI form, then C code handles the insertions.
There are no blob (BYTE or TEXT) literals in Informix Dynamic Server (IDS) - nor for CLOB or BLOB types in IDS 9.00 and later. It is an ongoing source of frustration to me; I've had the feature request in the system for years, but it never reaches the pain threshold internally that means it gets fixed -- other things get given a higher priority.
Nevertheless, it bites people all the time.
In IDS 7.3 (which you should aim to upgrade - it goes out of service in September 2009 after a decade or so), you are pretty much stuck with using C to get the data into the TEXT field of a database. You have to use the approved C type 'loc_t' to store the information about the BYTE or TEXT data, and pass that to the server.
If you need examples in ESQL/C, look at the International Informix User Group web site, and especially the Software Repository. Amongst other things, you'll find the original SQLCMD program (Microsoft's program of the same name is a Johnny-Come-Lately) in source form. It also includes a set of programs that I dub 'vignettes'; they manipulate blobs in various ways, and are designed to show how to use 'loc_t' structures in various scenarios.
in iSQL....
Load from desc.txt insert into crsabstr_rec;
3 row(s) loaded.
desc.txt is a | (pipe) delimited text file and the number of fields in the txt have to match the number of fields in the table