Hive Supporting Unicode Characters - hive

Are there any Serde available to support hive table with Unicode characters. We might have file in either UTF-8, UTF-18 and UTF-32.Which is nothing but we are looking for support different languages like Japanese, Chinese in hive table. We should be able to load different language data into hive table

Hive could only read and write UTF-8 text files.
for other character set,It should be converted into UTF-8.Syntax for conversion is
hive> CREATE TABLE mytable(name, datatype) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES("serialization.encoding"='FORMAT');
conversion can be done using iconv but it supports only files smaller than 16G.
syntax:
>iconv -f encoding -t encoding inputfile

Related

Japanese ANSI character in CSV file

I have a csv file generated from Japanese source system. The Japanese character is shown as given below ¬¼ˆã—Ê튔Ž®‰ïŽÐ ‘åã‰c‹ÆŠ. I have changed file type to UTF-8 and also ETL setting to incorporate that but that is working on new data only.
How can I change existing data in my table which shows characters like ‘åã‰c‹ÆŠ.
Is it possible to get original Japanese characters using SQL functions. I am using SQL Sever as database.
Thanks in advance.

How can I SERDE to build generic file ingestion into Hive?

I need to build generic file ingestion into Hive. The files are very large (2GB+), can be fixed or comma-separated, ASCII or EBCDIC files. After trying various techniques using Talend, I am looking into SERDE. If I ingest the files as-is and use a schema file (containing ordinal position, column name, type, length), can I create a custom SERDE to de-serialize any input file into hive rows? How performant would it be?
Since asking this question, I found that I could use a COBOL custom SERDE.
I am also looking at regex SERDE for positional files.

Hive ORC File Format

When we create an ORC table in hive we can see that the data is compressed and not exactly readable in HDFS. So how is Hive able to convert that compressed data into readable format which is shown to us when we fire a simple select * query to that table?
Thanks for suggestions!!
By using ORCserde while creating table. u have to provide package name for serde class.
ROW FORMAT ''.
What serde does is to serialize a particular format data into object which hive can process and then deserialize to store it back in hdfs.
Hive uses “Serde” (Serialization DeSerialization) to do that. When you create a table you mention the file format ex: in your case It’s ORC “STORED AS ORC” , right. Hive uses the ORC library(Jar file) internally to convert into a readable format. To know more about hive internals search for “Hive Serde” and you will know how the data is converted to object and vice-versa.

Registered Symbol not getting inserted as-is in table

I am working on Oracle 10gR2.
The character set for DB is as below:
NLS_NCHAR_CHARACTERSET AL16UTF16
NLS_CHARACTERSET AL32UTF8.
I am getting data to be processed in TXT files. The first step in processing this data is creating external tables based on these flat files. One of the fields (and the columns in DB) in the flat file has String data, which contains ® (registered symbol). This character is visible in the txt file, but when I check the external table, the character is saved as �
I have modified the encoding of the IDE to UTF-8, where I am seeing the output of the query.
The data type for the column is: COL NVARCHAR2(1000)
Please suggest as to what could be causing this?
Generally this is caused by incorrect setting of the NLS_LANG environment variable. The NLS_LANG variable must tell oracle the encoding you are using for your data. If the NLS_LANG is unset, oracle assumes ASCII text (and your symbol is non-ascii).
If your data is UTF-8, try:
NLS_LANG=.AL32UTF8
For windows/iso try
NLS_LANG=.WE8ISO8859P15
You NEED to determine the encoding of your text file first. Use a hex editor to determine of the (R) symbol is UTF-8 or not.

MySQL Convert latin1 data to UTF8

I imported some data using LOAD DATA INFILE into a MySQL Database. The table itself and the columns are using the UTF8 character set, but the default character set of the database is latin 1. Because the default character type of the database is latin1, and I used LOAD DATA INFILE without specifying a character set, it interpreted the file as latin1, even though the data in the file was UTF8. Now I have a bunch of badly encoded data in my UTF8 colum. I found this article which seems to address a similar problem, which is "UTF8 inserted in cp1251", but my problem is "Latin1 inserted in UTF8". I've tried editing the queries there to convert the latin1 data to UTF8, but can't get it to work. Either the data comes out the same, or even more mangled than before. Just as an example, the word Québec is showing as Québec.
[ADDITIONAL INFO]
When Selecting the data wrapped in HEX(), Québec has the value 5175C383C2A9626563.
The Create Table (shortened) of this table is.
CREATE TABLE MyDBName.`MyTableName`
(
`ID` INT NOT NULL AUTO_INCREMENT,
.......
`City` CHAR(32) NULL,
.......
`)) ENGINE InnoDB CHARACTER SET utf8;
I've had cases like this in old wordpress installations with the problem being that the data itself was already in UTF-8 within a Latin1 database (due to WP default charset). This means there was no real need for conversion of the data but the ddbb and table formats.
In my experience things get messed up when doing the dump as I understand MySQL will use the client's default character set which in many cases is now UTF-8.
Therefore making sure that exporting with the same coding of the data is very important. In case of Latin1 DDBB with UTF-8 coding:
$ mysqldump –default-character-set=latin1 –databases wordpress > m.sql
Then replace the Latin1 references within the exported dump before reimporting to a new database in UTF-8. Sort of:
$ replace "CHARSET=latin1" "CHARSET=utf8" \
"SET NAMES latin1" "SET NAMES utf8" < m.sql > m2.sql
In my case this link was of great help.
Commented here in spanish.
Though it is hardly still actual for the OP, I happen to have found a solution in MySQL documentation for ALTER TABLE. I post it here just for future reference:
Warning
The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.
LOAD DATA INFILE allows you to set an encoding file is supposed to be in:
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
I wrote that http://code.google.com/p/mysqlutf8convertor/ for Latin Database to UTF-8 Database. All tables and field to change UTF-8.
Converting latin1 to UTF8 is not what you want to do, you kind of need the opposite.
If what really happened was this:
UTF-8 strings were interpreted as Latin-1 and transcoded to UTF-8, mangling them.
You are now, or could be, reading UTF-8 strings with no further interpretation
What you must do now is:
Read the "UTF-8" with no transcode.
Convert it to Latin-1. Now you should actually have the original UTF-8.
Now put it in your "UTF-8" column with no further conversion.
I recently completed a shell script that automates the conversion process. It is also configurable to write custom filters for any text you wish to replace or remove. For example : stripping HTML characters etc. Table whitelists and blacklists are also possible. You can download it at sourceforge: https://sourceforge.net/projects/mysqltr/
Try this:
1) Dump your DB
mysqldump --default-character-set=latin1 -u username -p databasename < dump.sql
2) Open dump.sql in text editor and replace all occurences of "SET NAMES latin1" by "SET NAMES utf8"
3) Create a new database and restore your dumpfile
cat dump.sql | mysql -u root -p newdbname