MySQL Convert latin1 data to UTF8 - sql

I imported some data using LOAD DATA INFILE into a MySQL Database. The table itself and the columns are using the UTF8 character set, but the default character set of the database is latin 1. Because the default character type of the database is latin1, and I used LOAD DATA INFILE without specifying a character set, it interpreted the file as latin1, even though the data in the file was UTF8. Now I have a bunch of badly encoded data in my UTF8 colum. I found this article which seems to address a similar problem, which is "UTF8 inserted in cp1251", but my problem is "Latin1 inserted in UTF8". I've tried editing the queries there to convert the latin1 data to UTF8, but can't get it to work. Either the data comes out the same, or even more mangled than before. Just as an example, the word Québec is showing as Québec.
[ADDITIONAL INFO]
When Selecting the data wrapped in HEX(), Québec has the value 5175C383C2A9626563.
The Create Table (shortened) of this table is.
CREATE TABLE MyDBName.`MyTableName`
(
`ID` INT NOT NULL AUTO_INCREMENT,
.......
`City` CHAR(32) NULL,
.......
`)) ENGINE InnoDB CHARACTER SET utf8;

I've had cases like this in old wordpress installations with the problem being that the data itself was already in UTF-8 within a Latin1 database (due to WP default charset). This means there was no real need for conversion of the data but the ddbb and table formats.
In my experience things get messed up when doing the dump as I understand MySQL will use the client's default character set which in many cases is now UTF-8.
Therefore making sure that exporting with the same coding of the data is very important. In case of Latin1 DDBB with UTF-8 coding:
$ mysqldump –default-character-set=latin1 –databases wordpress > m.sql
Then replace the Latin1 references within the exported dump before reimporting to a new database in UTF-8. Sort of:
$ replace "CHARSET=latin1" "CHARSET=utf8" \
"SET NAMES latin1" "SET NAMES utf8" < m.sql > m2.sql
In my case this link was of great help.
Commented here in spanish.

Though it is hardly still actual for the OP, I happen to have found a solution in MySQL documentation for ALTER TABLE. I post it here just for future reference:
Warning
The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.

LOAD DATA INFILE allows you to set an encoding file is supposed to be in:
http://dev.mysql.com/doc/refman/5.1/en/load-data.html

I wrote that http://code.google.com/p/mysqlutf8convertor/ for Latin Database to UTF-8 Database. All tables and field to change UTF-8.

Converting latin1 to UTF8 is not what you want to do, you kind of need the opposite.
If what really happened was this:
UTF-8 strings were interpreted as Latin-1 and transcoded to UTF-8, mangling them.
You are now, or could be, reading UTF-8 strings with no further interpretation
What you must do now is:
Read the "UTF-8" with no transcode.
Convert it to Latin-1. Now you should actually have the original UTF-8.
Now put it in your "UTF-8" column with no further conversion.

I recently completed a shell script that automates the conversion process. It is also configurable to write custom filters for any text you wish to replace or remove. For example : stripping HTML characters etc. Table whitelists and blacklists are also possible. You can download it at sourceforge: https://sourceforge.net/projects/mysqltr/

Try this:
1) Dump your DB
mysqldump --default-character-set=latin1 -u username -p databasename < dump.sql
2) Open dump.sql in text editor and replace all occurences of "SET NAMES latin1" by "SET NAMES utf8"
3) Create a new database and restore your dumpfile
cat dump.sql | mysql -u root -p newdbname

Related

SQL encoding after restore from backup (mariadb, unix)

I need to restore a SQL table from daily backup but there are problems with encoding. Backup is made by virtualmin, encoding set to "default". Texts are in French language, so with accents...
Here is the dump of the webmin backup file:
For the table (wordpress table, interesting fields are:)
I need to insert a part of this table into the live table (after some deletion of lines..). So the table is already created with
Default collation UTF8mb4_unicode_ci
When I import the table lines into the table, text is not "converted" into the right charset. For example the french "é" shows up as "é". And so on.
I tried a few things, adding SET commands to utf8mb4 before the INSERT, no way, encoding is never done correctly. Text in the base itself shows "é" instead "é", and of course the same when displaying in a browser.
Any suggestion? Thank you!

How to load special characters (non-English letters) in SQL Loader

Some of my developer_id are in foreign language (special character). I googled how to handle those characters, and what people said was using
NVARCHAR2()
or use:
INSERT INTO table_name VALUES (N'你好');
However, I used NVARCHAR2() in stage and on all the tables but still doesn't work for me (the original datatype for developer_id was VARCHAR2()). Also, the insert statement with N at the beginning is not working for SQL Loader I think.
What should I do?
Here is where the problem shows:
Here is my ctl. file
Here is the datatype for all the data in the flat file:
The character set for the flat file is UTF-8. I thought I have successfully solved this problem by changing my Encoding when pre-loading the data to stage table, but the same problem still shows up when I finished importing my data to stage.

pgAdmin4: Importing a CSV

I am trying to import a CSV using pgAdmin4. I created the table using the query,
CREATE TABLE i210_2017_02_18
(
PROBE_ID character varying(255),
SAMPLE_DATE timestamp without time zone,
LAT numeric,
LON numeric,
HEADING integer,
SPEED integer,
PROBE_DATA_PROVIDER character varying(255),
SYSTEM_DATE timestamp without time zone
)
The header and first line of my CSV read is...
PROBE_ID,SAMPLE_DATE,LAT,LON,HEADING,SPEED,PROBE_DATA_PROVIDER,SYSTEM_DATE
841625st,2017-02-18 00:58:19,34.11968,-117.80855,91.0,9.0,FLEET53,2017-02-18 00:58:58
When I try to use the import dialogue, the process fails with Error Code 1:
ERROR: invalid input syntax for type timestamp: "SAMPLE_DATE"
CONTEXT: COPY i210_2017_02_18, line 1, column sample_date: "SAMPLE_DATE"
Nothing seems wrong to me - any ideas?
According to your table structure, this import will fail in the columns HEADING and SPEED, since their values have decimals and you declared them as INTEGER. Either remove the decimals or change the column type to e.g. NUMERIC.
Having said that, just try this from pgAdmin (considering that file and database are in the same server):
COPY i210_2017_02_18 FROM '/home/jones/file.csv' CSV HEADER;
In case you're dealing with a remote server, try this using psql from your console:
$ cat file.csv | psql yourdb -c "COPY i210_2017_02_18 FROM STDIN CSV HEADER;"
You can also check this answer.
In case you really want to stick to the pgAdmin import tool, which I discourage, just select the Header option and the proper Delimiter:
Have you set the Header-Option = TRUE?
Import settings
that should work.
Step 1: Create a table.
you can use a query or dashboard to create a table.
Step 2: Create the exact number of columns present in the CSV file.
I would recommend creating columns using the dashboard.
Step 3: Click on your table_name in pgadmin you will see an option for import/export.
Step 4: provide the path of your CSV file, remember to choose delimiter as comma,

Registered Symbol not getting inserted as-is in table

I am working on Oracle 10gR2.
The character set for DB is as below:
NLS_NCHAR_CHARACTERSET AL16UTF16
NLS_CHARACTERSET AL32UTF8.
I am getting data to be processed in TXT files. The first step in processing this data is creating external tables based on these flat files. One of the fields (and the columns in DB) in the flat file has String data, which contains ® (registered symbol). This character is visible in the txt file, but when I check the external table, the character is saved as �
I have modified the encoding of the IDE to UTF-8, where I am seeing the output of the query.
The data type for the column is: COL NVARCHAR2(1000)
Please suggest as to what could be causing this?
Generally this is caused by incorrect setting of the NLS_LANG environment variable. The NLS_LANG variable must tell oracle the encoding you are using for your data. If the NLS_LANG is unset, oracle assumes ASCII text (and your symbol is non-ascii).
If your data is UTF-8, try:
NLS_LANG=.AL32UTF8
For windows/iso try
NLS_LANG=.WE8ISO8859P15
You NEED to determine the encoding of your text file first. Use a hex editor to determine of the (R) symbol is UTF-8 or not.

bcp and backspace (^H) delimiter

I need to parse a flat file which is containing backspace (^H) character delimiter between fields. I need to parse this file and insert into sql server 2005 tables.I tried to use bcp utility along with the format file but I wasn't able to specify the delimiter as backspace.
The default one is tab (\t). There are several other delimiters as well but none to specify backspace. Anyone has any ideas, please do help me.
Also I need to export data from sql server table to fixed length flat file.I tried to use non-xml format file, but always it asks for a delimiter.How can I create a flat file using bcp without any delimiter between the fields?
All above are character files.
This is an ugly workaround, but you could always find something that's not in the flat file, and replace everything in the flat file with that, then use that as the column terminator (using bcp -t that).
Sorry that I'm almost 11 years late on this, hopefully you've already solved your problem but you can use the hexadecimal representation of the backspace character 0x08 to parse your input file and properly delimit your fields which are separated with a backspace character.