hive xml serDe : table is empty - hive

I want to store xml data into hive table, XML data :
<servicestatuslist>
<recordcount>1266</recordcount>
<servicestatus id="435680">
<status_text>/: 61%used(9714MB/15975MB) (<80%) : OK</status_text>
<display_name>/ Disk Usage</display_name>
<host_name>zabbix.vshodc.com</host_name>
</servicestatus>
</servicestatuslist>
I have added jar file to path
hive> add jar /home/cloudera/HiveJars/hivexmlserde-1.0.5.1.jar ;
Added /home/cloudera/HiveJars/hivexmlserde-1.0.5.1.jar to class path
Added resource: /home/cloudera/HiveJars/hivexmlserde-1.0.5.1.jar
I have written a hive serDe query:
create table xml_AIR(id STRING, status_text STRING,display_name STRING ,host_name STRING)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties(
"column.xpath.id"="/servicestatus/#id",
"column.xpath.status_text"="/servicestatus/status_text/text()",
"column.xpath.display_name"="/servicestatus/display_name/text()",
"column.xpath.host_name"="/servicestatus/host_name/text()"
)
stored as
inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/cloudera/input/air.xml'
tblproperties(
"xmlinput.start"="<servicestatus",
"xmlinput.end"="</servicestatus>"
);
OK
Time taken: 1.609 seconds
When I issued select command , it didn't show the table's data:
hive> select * from xml_AIR;
OK
Time taken: 3.0 seconds
What's wrong in the above code? Please help.

I came out through the same Problem when dealing with XML Serde. After some struggle, I fixed it by using the "Load data" statement separately and avoiding addition of "LOCATION" property in "CREATE" statement.
the following is my XML data.
<record customer_id="0000-JTALA">
<income>200000</income>
<demographics>
<gender>F</gender>
<agecat>1</agecat>
<edcat>1</edcat>
<jobcat>2</jobcat>
<empcat>2</empcat>
<retire>0</retire>
<jobsat>1</jobsat>
<marital>1</marital>
<spousedcat>1</spousedcat>
<residecat>4</residecat>
<homeown>0</homeown>
<hometype>2</hometype>
<addresscat>2</addresscat>
</demographics>
<financial>
<income>18</income>
<creddebt>1.003392</creddebt>
<othdebt>2.740608</othdebt>
<default>0</default>
</financial>
</record>
CREATE TABLE Statement:
CREATE TABLE xml_bank(customer_id STRING, income BIGINT, demographics map<string,string>, financial map<string,string>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.customer_id"="/record/#customer_id",
"column.xpath.income"="/record/income/text()",
"column.xpath.demographics"="/record/demographics/*",
"column.xpath.financial"="/record/financial/*"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<record customer",
"xmlinput.end"="</record>"
);
CREATE Query Result:
OK
Time taken: 0.925 seconds
hive>
for the above create statement, I used the following "LOAD DATA" statement to load the data contained in an XML file in to the above created table.
hive> load data local inpath '/home/mahesh/hive_input_datasets/XMLdata/XMLdatafile.xml' overwrite into table xml_bank6;
LOAD Query Result:
Copying data from file:/home/mahesh/hive_input_datasets/XMLdata/XMLdatafile.xml
Copying file: file:/home/mahesh/hive_input_datasets/XMLdata/XMLdatafile.xml
Loading data to table default.xml_bank6
Table default.xml_bank6 stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 500, raw_data_size: 0]
OK
Time taken: 0.879 seconds
hive>
And finally,
SELECT Query and Result:
hive> select * from xml_bank6;
OK
0000-JTALA 200000 {"empcat":"2","jobcat":"2","residecat":"4","retire":"0","hometype":"2","addresscat":"2","homeown":"0","spousedcat":"1","gender":"F","jobsat":"1","edcat":"1","marital":"1","agecat":"1"} {"default":"0","income":"18","othdebt":"2.740608","creddebt":"1.003392"}
Time taken: 0.149 seconds, Fetched: 1 row(s)
hive>
And in the above query i would suggest the value for "xmlinput.start" as "<servicestatus id", instead of "<servicestatus",because the XML start tag is in the pattern <servicestatus id="some data">.I believe this would be helpful for you.

Well, the code looks good. As per the example in this link, it should work for you.
Btw, there is a typo in the code that you have provided. In the table definition status_test STRING should be status_text STRING or vice versa.

The entire XML file should be a single line (i.e. no newlines in the XML). (A simple unix command to strip newlines is tr '\n\r' ' ' < source.xml > processed.xml.)
https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources

According to Hive DDL documentation, LOCATION clause expects an hdfs_path. Hence, try specifying only the directory, not the whole path to your XML file. By using LOAD after CREATE TABLE, you cannot have external tables, which might be an interesting approach in some cases.
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/TruncateTable

LOCATION give directory only instead of file
create table xml_AIR(id STRING, status_text STRING,display_name STRING ,host_name STRING)
row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
with serdeproperties(
"column.xpath.id"="/servicestatus/#id",
"column.xpath.status_text"="/servicestatus/status_text/text()",
"column.xpath.display_name"="/servicestatus/display_name/text()",
"column.xpath.host_name"="/servicestatus/host_name/text()"
)
stored as
inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/cloudera/input'
tblproperties(
"xmlinput.start"="<servicestatus",
"xmlinput.end"="</servicestatus>"
);

Related

Snowflake to Hive data movement with partition

We have a requirement to move data from Snowflake to Hive. I am able to unload data from snowflake to aws S3 and do and msck repair on Hive.
But all records are coming as null in Hive. What could be the reason ? Is there anything wrong here .
To check the parquet is created correctly , I read the Parquet file using Spark . I am able to read the parquet file.
##Snowflake
create or replace stage dev_zone.DAILY_LOG url= 's3://myc-mlb-alpha-us-east-1-drg-322t232/hive/rs_hive_008_test1' storage_integration = DEV_HIVE_INTEGRATION file_format = (type = 'parquet') ENCRYPTION = (TYPE = 'AWS_SSE_S3');
copy into #dev_zone.DAILY_LOG from (select * from dev_zone.DAILY_LOG limit 100) partition by ('as_on_date=' ||as_on_date);
##Hive
CREATE EXTERNAL TABLE dev_zone.DAILY_LOG(
dim_id decimal(38,0),
card_type string,
type string,
cntry string,
PARTITIONED BY (
as_on_date date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://myc-mlb-alpha-us-east-1-drg-322t232/hive/rs_hive_008_test1'
What I missed was to add header = true
copy into #dev_zone.DAILY_LOG from (select * from dev_zone.DAILY_LOG limit 100) partition by ('as_on_date=' ||as_on_date) header = true;

Hive: Cannot insert arrays and maps from file in hive table

Here is the schema of the table i have
CREATE DATABASE IF NOT EXISTS mydb;
USE mydb;
CREATE TABLE IF NOT EXISTS mytab (
idcol string,
arrcol array<string>,
mapcol map<string,string>
)
PARTITIONED BY (data_date string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;
now all i want to do is insert a single row in this table. I have that row in a psv file as
123|["a","b"]|{"1":"a","2":"b"}
Here is how i try to load the data
USE mydb;
LOAD DATA INPATH '/path/to/file' INTO TABLE mytab PARTITION (data_date='2019-02-02');
the query succeeds but when i see the results
hive -e "use mydb; select * from mytab where data_date='2019-02-02';"
i get
hive> select * from mytab;
OK
123 ["[\"a\",\"b\"]"] {"{\"1\":\"a\",\"2\":\"b\"}":null} 2019-02-02
Time taken: 2.39 seconds, Fetched: 1 row(s)
So looks like the LOAD did some transformation on the data. It kept the string value fine, but had some issues with the array and the map.
How can i properly insert arrays and maps ?
I also tried the following as input
123|array("a","b")|{"1":"a","2":"b"}
The load succeeded, but when i queried the data, i got
root#0d2b0044b4c1:/opt# hive -e "use mydb;select * from mytab;"
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
OK
Time taken: 6.096 seconds
OK
123 ["array(\"a\",\"b\")"] {"{\"1\":\"a\",\"2\":\"b\"}":null} 1554090675
Time taken: 3.266 seconds, Fetched: 1 row(s)
UPDATE
thanks a lot #pedram bashiri for your answer. I created the external table and was able to populate it. However, everything gets populated as string
hive> drop table if exists extab;
OK
Time taken: 0.01 seconds
hive> create external table extab(idcol string,arrcol array<string>,mapcol map<string,string>, data_date string)
> row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> with serdeproperties (
> "separatorChar" = "|",
> "quoteChar" = "\"",
> "escapeChar" = "\\"
> )
> stored as textfile
> location '/tmp/serdes/';
OK
Time taken: 0.078 seconds
hive> desc extab;
OK
idcol string from deserializer
arrcol string from deserializer
mapcol string from deserializer
data_date string from deserializer
Time taken: 0.059 seconds, Fetched: 4 row(s)
hive> select * from extab;
OK
123 ["a","b"] {"1":"a","2":"b"} 2019
Time taken: 0.153 seconds, Fetched: 1 row(s)
hive>
here is what is stored in hdfs
root#0d2b0044b4c1:/opt# hadoop fs -ls -R /tmp/serdes/
-rw-r--r-- 1 root root 37 2019-04-04 22:06 /tmp/serdes/x.psv
root#0d2b0044b4c1:/opt# hadoop fs -cat /tmp/serdes/x.psv
123|["a","b"]|{"1":"a","2":"b"}|2019
root#0d2b0044b4c1:/opt#
I also tried
create external table extab(idcol string,arrcol array<string>,mapcol map<string,string>, data_date string)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
"separatorChar" = "|"
)
stored as textfile
location '/tmp/serdes/';
but still, everything gets stored as string so now i when i try to insert i get type mismatch.
So after a lot of digging, i figured it out
CREATE TABLE IF NOT EXISTS mytab (
idcol string,
arrcol array<string>,
mapcol map<string,string>
)
PARTITIONED BY (data_date string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY '='
STORED AS TEXTFILE;
then, i can just load the following
123|a,b|1=a,2=b|2019
root#0d2b0044b4c1:/opt# hive -e "use mydb; LOAD DATA INPATH '/path/to/file' INTO TABLE mytab PARTITION (data_date='2019');"
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
OK
Time taken: 6.456 seconds
Loading data to table mydb.mytab partition (data_date=2019)
OK
Time taken: 1.912 seconds
root#0d2b0044b4c1:/opt# hive -e "use mydb; select * from mytab;"
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
OK
Time taken: 6.843 seconds
OK
123 ["a","b"] {"1":"a","2":"b"} 2019
root#0d2b0044b4c1:/opt#
which is exactly what i needed
Use opencsv to create an external table based on your psv file and call it mytab_exterrnal. Specify serdeproperties like
with serdeproperties (
"separatorChar" = "|",
"quoteChar" = """,
"escapeChar" = "\\"
)
And then simply do
INSERT INTO mytab
SELECT * FROM mytab_external;
https://community.hortonworks.com/articles/8313/apache-hive-csv-serde-example.html

amazon athena create request with partitions

I create a table with partitions as follow : first by year, month, and day.
Question : I hope get data of 12/2017 and 03/2018, how I can do this?
What I think do :
where (year='2017' and month='12') and ( year ='2018' and month='03')
Is it correct? I will not have a confusion so Amazon Athena get data of:
12/2017 and 03/2018 and 03/2017 and 12/2018
because of the and operator ?
PS: I can't test, I have only free account.
Thanks.
Anyway, I tried in a mini set of data and I found that Amazon Athena take into account the parenthesis.
My test is as follow :
The DDl of table as générated :
CREATE EXTERNAL TABLE `manyands`(
`years` int COMMENT 'from deserializer',
`months` int COMMENT 'from deserializer',
`days` int COMMENT 'from deserializer')
PARTITIONED BY (
`year` string,
`month` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://mybucket/'
My set of data test:
My tests:
1- SELECT * FROM "atlasdatabase"."manyands" where month='1';
I got in CSV format :
"years","months","days","year","month"
"2017","1","21","2017","1"
"2018","1","81","2018","1"
2- SELECT * FROM "atlasdatabase"."manyands" where month='1' and year='2017';
"years","months","days","year","month"
"2017","1","21","2017","1"
3- SELECT * FROM "atlasdatabase"."manyands" where (month='1' and year='2018') and (month='3' and year='2017') ;
empty (Zéro enregistrements renvoyés)
4- SELECT * FROM "atlasdatabase"."manyands" where (month='1' and year='2018') or (month='3' ) ;
"years","months","days","year","month"
"2018","1","81","2018","1"
"2017","3","73","2017","3"
"2018","3","73","2018","3"
Conclusion : add OR operator between many instances of the partitions.

Pig - reading Hive table stored as Avro

I have created a hive table stored with Avro file format. I am trying to load same hive table using below Pig commands
pig -useHCatalog;
hive_avro = LOAD 'hive_avro_table' using org.apache.hive.hcatalog.pig.HCatLoader();
I am getting " failed to read from hive_avro_table " error when I tried to display "hive_avro" using DUMP command.
Please help me how to resolve this issue. Thanks in advance
create table hivecomplex
(name string,
phones array<INT>,
deductions map<string,float>,
address struct<street:string,zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '$'
MAP KEYS TERMINATED BY '#'
STORED AS AVRO
;
hive> select * from hivecomplex;
OK
John [650,999,9999] {"pf":500.0} {"street":"pleasantville","zip":88888}
Time taken: 0.078 seconds, Fetched: 1 row(s)
Now for the pig
pig -useHCatalog;
a = LOAD 'hivecomplex' USING org.apache.hive.hcatalog.pig.HCatLoader();
dump a;
ne.util.MapRedUtil - Total input paths to process : 1
(John,{(650),(999),(9999)},[pf#500.0],(pleasantville,88888))

java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow

I am trying to get the compression to work.
Original Table defined as :
create external table orig_table (col1 String ...... coln String)
.
.
.
partitioned by (pdate string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ( "separatorChar" = "|")
STORED AS TEXTFILE location '/user/path/to/table/';
The table orig_table has about 10 partitions with 100 rows each
To compress it, I have created a similar table with the only modification from TEXTFILE to ORCFILE
create external table orig_table_orc (col1 String ...... coln String)
.
.
.
partitioned by (pdate string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ( "separatorChar" = "|")
STORED AS ORCFILE location '/user/path/to/table/';
Trying to copy the records across by:
set hive.exec.dynamic.partition.mode=nonstrict;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;
[have tried with other codecs as well, with same error]
set mapred.output.compression.type=RECORD;
insert overwrite table zip_test.orig_table_orc partition(pdate) select * from default.orgi_table;
The error I get is:
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"col1":value ... "coln":value}
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:503)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:176)
... 8 more
Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:81)
at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:689)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95)
at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:493)
... 9 more
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 3 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
Same thing works if I make the hive table as a SEQUENCEFILE - not with ORC, any work around? I have seen a couple of questions that have the same error but in a Java program and not Hive QL
Gaah! ORC is nothing like CSV!!!
Explaining what you did wrong would take a couple of hours and a good many book excerpts about Hadoop and about DB technology in general, so the short answer is: ROW FORMAT and SERDE do not make sense for a columnar format. And since you are populating that table from within Hive, it's not an EXTERNAL but a "managed" table I.M.H.O.
create table orig_table_orc
(col1 String ...... coln String)
partitioned by (pdate string)
stored as Orc
location '/where/ever/you/want'
TblProperties ("orc.compress"="ZLIB")