How do i map a HBase column with no qualifier in Hive? - hive

I want to map my HBase table to Hive, this is what i got:
CREATE EXTERNAL TABLE kutschke.bda01.twitter (
rowkey BIGINT,
userId BIGINT,
text STRING,
creationTime STRING,
isRetweet BOOLEAN,
retweetId BIGINT
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key, user:id, text:, time:createdAt, retweet:isRetweet, retweet:retweetId'
TBLPROPERTIES('hbase.table.name' = 'kutschke.bda01.twitter'
However, the 'text:' column doesn't get properly mapped because it has no qualifier. Instead i get the exception:
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException:
MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.hbase.HBaseSerDe:
hbase column family 'text' should be mapped to Map<? extends LazyPrimitive<?, ?>,?>,
that is the Key for the map should be of primitive type, but is mapped to string)
I think i understand the logic behind mapping the whole column family to Map, but is there a way to properly map the column with the empty qualifier? If not, how do i need to go about mapping the column family to a MAP, and how will i retrieve the column i actually want?

This can be done by typing the Hive column as the Hive native map type, like this:
CREATE TABLE hbase_table_1(value map<string,int>, row_key int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = "cf:,:key"
);
The output from the field mapped to a whole CF will be presented as a json string.
More info here : https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-HiveMAPtoHBaseColumnFamily

Related

Why array values appear in impala but not hive?

I have a column defined as array in my table (HIVE) .
create external table rule
id string,
names array<string>
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '|'stored as parquet
location 'hdfs://folder'
Exemple of value in names : Joe|Jimmy
As i query the table in Impala, i retrieve the data but in hive i only have NULL. Why this behavior? I would even understand the inverse.
I found the answer. the data was written from a spark job in string instead of array.

Create a table in Hive and populate it with data

While trying to load data in a Hive table I encountered a behavior that looks strange to me. My data is made up of JSON objects loaded as records in a table called twitter_test containing a single column named "json".
Now I want to extract three fields from each JSON and build a new table called "my_twitter". I thus issue the command
CREATE TABLE my_twitter AS SELECT regexp_replace(get_json_object(t.json, '$.body\[0]'), '\n', '') as text, get_json_object(t.json, '$.publishingdate\[0]') as created_at, get_json_object(t.json, '$.author_screen_name\[0]') as author from twitter_test AS t;
The result is a table with three columns that contains no data. However, if I run the SELECT command alone it returns data as expected.
By trial and error I found out that i need to add LIMIT x at the end of the query for data to be inserted in the new table. The question is: why?
Furthermore, seems strange that I need to know in advance the number x of rows returned by the SELECT statement for the CREATE to work correctly. Is there any workaround?
You could create a table on this json data using the JSON serde which would parse the json objects and then you could easily select each individual columns easily.
Find below a sample hive DDL for creating a json table using json serde
CREATE EXTERNAL TABLE `json_table`(
A string
,B string
)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'PATH'

Hive: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException

I have a Parquet file (created by Drill) that I'm trying to read in Hive as an external table. The data types are copied one-to-one (i.e. INTEGER -> INT, BIGINT -> BIGINT, DOUBLE -> DOUBLE, TIMESTAMP -> TIMESTAMP, CHARACTER VARYING -> STRING). There are no complex types.
Drill has no problem querying the file it created, but Hive does not like it:
CREATE EXTERNAL TABLE my_table
(
<col> <data_type>
)
STORED AS PARQUET
LOCATION '<hdfs_location>';
I can execute SELECT COUNT(*) FROM my_table and get the correct number of rows back, but when I ask for the first row it says:
Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.LongWritable (state=,code=0)
I'm not sure why it complains because I use integers and big integers, none of which I assume are stored as longs. Moreover, I would assume that an integer can be cast to a long. Is there a known workaround?
its just because of your data.
I was also facing same issue.
My data in the format of int and I have created external table as String.
Give appropriate datatypes in hive create statement.
Hive does not support certain data types e.g long - use bigint
Here is the 2-steps solution:
First, drop the Table
Drop TABLE if exists <TableName>
Second, recreate the Table, this time with 'bigint' instead of 'long'
Create external TABLE <TableName>
(
<col> bigint
)
Stored as Parquet
Location '<hdfs_location>';

How to access individul elements of a blob in dynamoDb using a hive script?

I am transferring data from DynamoDB to S3 using a hive script in AWS Data Pipeline. I am using a script like this :
CREATE EXTERNAL TABLE dynamodb_table ( PROPERTIES STRING, EMAIL
STRING, ............. ) STORED BY
'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES
("dynamodb.table.name" = "${DYNAMODB_INPUT_TABLE}",
"dynamodb.column.mapping" =
"PROPERTIES:Properties,EMAIL:EmailId....");
CREATE EXTERNAL TABLE s3_table (
PROPERTIES STRING,
EMAIL STRING,
......
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY'\n'
LOCATION '${S3_OUTPUT_BUCKET}';
INSERT OVERWRITE TABLE s3_table SELECT * FROM dynamodb_table;
The Properties column in DyanmoDB table is like this
Properties : String
:{\"deal\":null,\"MinType\":null,\"discount\":null}
that is it contains multiple attributes in it. I want each attribute in Properties to come as a separate column (not just a string in a single column). I want the output in this schema
deal MinType discount EMAIL
How can I do this?
Is your Properties column in proper JSON format? If so, it looks like you can - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object

Hive MAP isn't reading input correctly

I am trying create a table on this mahout recommender system output data on s3.
703209355938578 [18519:1.5216354,18468:1.5127649,17962:1.5094717,18317:1.5075916]
828667482548563 [18070:1.0,18641:1.0,18632:1.0,18770:1.0,17814:1.0,18095:1.0]
1705358040772485 [18783:1.0,17944:1.0,18632:1.0,18770:1.0,18914:1.0,18386:1.0]
with this schema,
CREATE external table user_ad_reco (
userid bigint,
reco MAP<bigint , double>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
LOCATION
's3://xxxxx/data/RS/output/m05/';
but while I am reading data back with hive,
hive >
select * from user_ad_reco limit 10;
It is giving output like this
703209355938578 {18519:1.5216354,18468:1.5127649,17962:null}
828667482548563 {18070:1.0,18641:1.0,18632:1.0,18770:1.0,17814:null}
1705358040772485 {18783:1.0,17944:1.0,18632:1.0,18770:1.0,18914:null}
So, last key:value of map input is missing in output with null in last output pair :(.
Can anyone help regarding this?
Reason for nulls :
input data format with brackets gives null, cause of brackets the row
format in not being properly read , the last map entry 1.5075916
is being read as 1.5075916], so it's giving null due to data type
mismatch.
703209355938578 [ 18519:1.5216354,18468:1.5127649,17962:1.5094717,18317:1.5075916 ]
input data format without brackets works clean : (tested)
703209355938578 18519:1.5216354,18468:1.5127649,17962:1.5094717,18317:1.5075916
Thanks #ramisetty, I have done it in some indirect way, first got rid of two brackets [,] out of the map string, then create schema on string without brackets that.
CREATE EXTERNAL TABLE user_ad_reco_serde (
userid STRING,
reco_map STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]+)\\s\\[([^]]+)]"
)
STORED AS TEXTFILE
LOCATION
's3://xxxxxx/data/RS/output/6m/2014-01-2014-05/';
CREATE external table user_ad_reco_plain(
userid bigint,
reco string)
LOCATION
's3://xxxxx/data/RS/output/6m_plain/2014-01-2014-05/';
CREATE external table user_ad_reco (
userid bigint,
reco MAP<bigint , double>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
LOCATION
's3://xxxxxx/data/RS/output/6m_plain/2014-01-2014-05/';
There might be some simpler way.