Hive table from HBase with a column cotaining avro - hive

I was able to create an external Hive table with just one column containing an Avro data stored into HBase through the following query:
CREATE EXTERNAL TABLE test_hbase_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,familyTest:columnTest",
"familyTest.columnTest.serialization.type" = "avro",
"familyTest.columnTest.avro.schema.url" = "hdfs://path/person.avsc")
TBLPROPERTIES (
"hbase.table.name" = "otherTest",
"hbase.mapred.output.outputtable" = "hbase_avro_table",
"hbase.struct.autogenerate"="true");
What I wish to do is to create a table with the same avro file and other columns containing strings or integer but I was not able to do that and didn't find any example. Can anyone help me? Thank you

Related

Hive table with Avro Schema

I have created a hive external table with Avro Schema (complex types) with partition columns. After adding the required partition files, select query returns null values for all the columns except the partition columns. Avro schema has arrays and structs types inside.
Here is the DDL,
CREATE EXTERNAL TABLE mytable PARTITIONED BY(date int, city string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/cloud/location'
TBLPROPERTIES ('avro.schema.url'='cloud location for schema file');
I tried to give the schema file directly and tried with schema literal as well in the TBLPROPERTIES.
Select query returns null for all the columns.
Any suggestions for fixing this issue? is there anything missing in this scenario?

Define nested items while creating table in Hive

I am trying to create an external hive table using CSV as input file.
How my data looks like:
xxx|2021-08-14 07:10:41.080|[{"sub1","90"},{"sub2","95"}]
I am creating the table using below sql:
CREATE EXTERNAL TABLE mydb.mytable (
Name string,
Last_upd_timestamp timestamp,
subjects array<struct<sub_code:string,sub_marks:string>>
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('collection.delim'=',','field.delim'='|','serialization.format'='|')
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
Location 'hdfs://nameservice1/myinputfile'
)
When i try the above, table is created with subjects column like:
[{"sub_code":"[{\"sub1\",\"90\"},{\"sub2\",\"95\"}]","sub_marks":null}]
Not sure what I am doing wrong in the above. Would highly appreciate if someone can help me with how I can create the table in expected output.

Parquet Files Generation with hive

I'm trying to generate some parquet files with hive,to accomplish this i loaded a regular hive table from some .tbl files, throuh this command in hive:
CREATE TABLE REGION (
R_REGIONKEY BIGINT,
R_NAME STRING,
R_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
location '/tmp/tpch-generate';
After this i just execute this 2 lines:
create table parquet_reion LIKE region STORED AS PARQUET;
insert into parquet_region select * from region;
But when i check the output generated in HDFS, i dont find any .parquet file, intead i find files names like 0000_0 to 0000_21, and the sum of their sizes are much bigger that the original tbl file.
What im i doing Wrong?
Insert statement doesn't create file with extension but these are the parquet files.
You can use DESCRIBE FORMATTED <table> to show table information.
hive> DESCRIBE FORMATTED <table_name>
Additional Note: You can also create new table from source table using below query:
CREATE TABLE new_test row STORED AS PARQUET AS select * from source_table
It will create new table as parquet format and copies the structure as well as the data.

how to map hbase to hive?

I have an hbase table in the above format:
key : userId#country
column family: k
columns: date#visits, visits
How to i make an hive table which looks like this:
userId, date, country, visits
i tried to fiddle my way around with column mapping and so far i only managed to do this:
CREATE EXTERNAL TABLE hbase_table(key string, visits int)
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '#'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,k:visits")
TBLPROPERTIES ("hbase.table.name" = "kpi");
I had been working this for hours, and didn't had much progress. Can some1 point me in the right direction?
I found out how to map a hbase key into a hive row, it's not exactly what I want but it helps...:
CREATE EXTERNAL TABLE hbase_table(key struct<id:string, country:string>, visits int)
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '#'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,k:visits")
TBLPROPERTIES ("hbase.table.name" = "kpi");
Is userId a coulmn in your columnfamily 'k'? if it is then dont give ":key" inside the mapping. Try giving "k:userId"

I have a json file and I want to create Hive external table over it but with more descriptive field names

I have a JSON file and I want to create Hive external table over it but with more descriptive field names.Basically, I want to map the less descriptive field names present in json file to more descriptive fields in Hive external table.
e.g.
{"field1":"data1","field2":100}
Hive Table:
Create External Table my_table (Name string, Id int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
LOCATION '/path-to/my_table/';
Where Name points to field1 and Id points to field2.
Thanks!!
You can use this SerDe that allows custom mappings between the JSON data and the hive columns: https://github.com/rcongiu/Hive-JSON-Serde
See in particular this part: https://github.com/rcongiu/Hive-JSON-Serde#mapping-hive-keywords
so, in your case, you'd need to do something like
CREATE EXTERNAL TABLE my_table(name STRING, id, INT)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
"mapping.name" = "field1",
"mapping.id" = "field2" )
LOCATION '/path-to/my_table/'
Note that hive column names are case insensitive, while JSON attributes
are case sensitive.