how to select with bomb in hive? - hive

I have a table with data like:
A B
a1 b1
a2 b2
I want to exe sql like :
select A,B from test where A = 'a1'
but a1 have a UTF-8 BOM, so I can not get data like (a1,b1).
I do not have the power to change data, so how to write sql like where A = '???'

Try to issue
ALTER TABLE test SET SERDEPROPERTIES ('serialization.encoding'='UTF-8');
before your SELECT statement.
Or you can produce such a new table test2 :
CREATE TABLE test2
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('serialization.encoding'='UTF-8')
AS SELECT * FROM test;

Related

Hive is not handling integer values properly when loading text data into table

I was loading some text data into Apache Hive containing int columns. It was storing null values at unexpected places. So, I ran some tests:
create table testdata (c1 INT, c2 FLOAT) row format delimited fields terminated by ',' stored as textfile;
load data local inpath "testdata.csv" overwrite into table testdata;
select * from testdata;
testdata.csv contains this data:
1,1.0
1, 1.0
1 ,1.0
1 , 1.0
As you can see, dataset contains some extra whitespace around numbers. But this is causing hive to store null values in integer columns, while float is being parsed correctly.
Select query output:
1 1.0
NULL 1.0
NULL 1.0
NULL 1.0
Why this is happening so, and how to correctly handle these cases?
You can not do it in one step.
First load the data as string in stg table and then load into final table from stg table by removing space.
Create and load table like below.
create table testdata (c1 string, c2 string) row format delimited fields terminated by ',' stored as textfile;
create table stgtestdata as select * from testdata;
load data local inpath "testdata.csv" overwrite into table stgtestdata;
Use insert to load into final table by trimming space and convert properly like below
Insert overwrite testdata
select
Cast(trim(c1) as int) as c1,
Cast(trim(c2) as float) as c2
from stgtestdata;

Can we load text file separated by :: into hive table?

Is there a way to load a simple text file where fields are separated by "::" into hive table other than replacing those "::" with "," and then load it?
Replacing the "::" with "," is quicker when the text file is small but what if contains millions of records?
Try creating Hive table using Regex serde
Example:
i had file with below text in it.
i::90
w::99
Create Hive table:
hive> create external table default.i
(Id STRING,
Name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex' = '(.*?)::(.*)')
STORED AS TEXTFILE;
Select from Hive table:
hive> select * from i;
+-------+---------+--+
| i.id | i.name |
+-------+---------+--+
| i | 90 |
| w | 99 |
+-------+---------+--+
In case if you want to skip the header then use below syntax:
hive> create external table default.i
(Id STRING,
Name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex' = '(.*?)::(.*)')
STORED AS TEXTFILE
tblproperties ('skip.header.line.count'='1');
UPDATE:
Check is there any older files in your table location.if some files are there then delete them(if you don't want them).
1.Create Hive table as:
create external table <db_name>.<table_name>
(col1 STRING,
col2 STRING,
col3 string,
col4 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex' = '(.*?)::(.*?)::(.*?)::(.*)')
STORED AS TEXTFILE;
2.Then run:
load data local inpath 'Source path' overwrite into table 'Destination table'

How can I create an external table using textfile with presto?

I've a csv file in hdfs directory /user/bzhang/filefortable:
123,1
And I use the following to create an external table with presto in hive:
create table hive.testschema.au1 (count bigint, matched bigint) with (format='TEXTFILE', external_location='hdfs://192.168.0.115:9000/user/bzhang/filefortable');
But when I run select * from au1, I got
presto:testschema> select * from au1;
count | matched
-------+---------
NULL | NULL
I changed the comma to the TAB as the delimeter but it still returns NULL. But If I modify the csv as
123
with only 1 column, the select * from au1 gives me:
presto:testschema> select * from au1;
count | matched
-------+---------
123 | NULL
So maybe I'm wrong with the file format or anything else?
I suppose the field delimiter of the table is '\u0001'.
You can change the ',' to '\u0001' or change the field delimiter to ',' , and check your problem was solved

How to output a table as a parquet file in spark-sql, not spark-shell?

It is easy to read a table from CSV file using spark-sql:
CREATE TABLE MyTable (
X STRING,
Y STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\,",
"quoteChar" = "\""
)
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH 'input.csv' INTO TABLE MyTable;
But how can I output this result as Parquet file?
PS: I know how to do that in spark-shell, but it is not what I'm looking for.
You have to create one table with the schema of your results in hive stored as parquet. After getting the results you can export them into the parquet file format table like this.
set hive.insert.into.external.tables = true
create external table mytable_parq ( use your source table DDL) stored as parquet location '/hadoop/mytable';
insert into mytable_parq select * from mytable ;
or
insert overwrite directory '/hadoop/mytable' STORED AS PARQUET select * from MyTable ;

Writing columns having NULL as some string using OpenCSVSerde - HIVE

I'm using 'org.apache.hadoop.hive.serde2.OpenCSVSerde' to write hive table data.
CREATE TABLE testtable ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ","
"quoteChar" = "'"
)
STORED AS TEXTFILE LOCATION '<location>' AS
select * from foo;
So, if 'foo' table has empty strings in it, for eg: '1','2','' . The empty strings are written as is to the textfile. The data in textfile reads '1','2',''
But if 'foo' contains null values, for eg: '1','2',null. The null value is not written in the text file.
The data in the textfile reads '1','2',
How do I make sure that the nulls are properly written to the textfile using csv serde. Either written as empty strings or any other string say "nullstring"?
I also tried this:
CREATE TABLE testtable ROW FORMAT SERDE
....
....
STORED AS TEXTFILE LOCATION '<location>'
TBLPROPERTIES ('serialization.null.format'='')
AS select * foo;
Though this should probably replace the empty strings with null. But this doesn't even do that.
Please guide me on how to write nulls to csv files.
Will I have to check for the null values for columns in the select query itself like (NVL or something) and replace it with something?
Open CSV Serde ignores 'serialization.null.format' property , you can handle null values using below steps
1. CREATE TABLE testtable
(
name string,
title string,
birth_year string
)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ","
,"quoteChar" = "'"
)
STORED AS TEXTFILE;
2. load data into testtable
3. CREATE TABLE testtable1
(
name string,
title string,
birth_year string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
TBLPROPERTIES('serialization.null.format'='');
4. INSERT OVERWRITE TABLE testtable1 SELECT * FROM testtable