How to split HBase row key into 2 columns in Hive table - hive

HBase Table
rowkey: 2020-02-02^ghfgewr3434555, cf:1 timestamp=1604405829275, value=true
rowkey: 2020-02-02^ghfgewr3434555, cf:2 timestamp=1604405829275, value=true
rowkey: 2020-02-02^ghfgewr3434555, cf:3 timestamp=1604405829275, value=false
rowkey: 2020-02-02^ghfgewr3434555, cf:4 timestamp=1604405829275, value=false
Transfer HBase data into Hive table like below
Hive table
date ========= Id ======== cf:no == boolean
2020-02-02 ==== ghfgewr3434555 == 1 ======= true
2020-02-02 ==== ghfgewr3434555 == 2 ======= true
2020-02-02 ==== ghfgewr3434555 == 3 ======= false
2020-02-02 ==== ghfgewr3434555 == 4 ======= false

If you are thinking to transfer it only to be queried, you can actually create a connection in hive to that table specifying the properties
CREATE TABLE foo(rowkey STRING, a STRING, b STRING)
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’
WITH SERDEPROPERTIES (‘hbase.columns.mapping’ = ‘:key,f:c1,f:c2’)
TBLPROPERTIES (‘hbase.table.name’ = ‘bar’);
Proper doc here:
https://blog.cloudera.com/hbase-via-hive-part-1/

I have solved this problem using 2 table/View. First 1 just coping data from HBase table and second table/view split the rowkey into 2 columns.
First Table query in Hive
CREATE EXTERNAL TABLE hbase_hive_table(
key string,
t1 boolean,
t2 boolean
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH
SERDEPROPERTIES ("hbase.columns.mapping" = "cf:1#b,cf:2#b)
TBLPROPERTIES ("hbase.table.name" = "hbase_table");
First Table/View query in Hive
CREATE VIEW IF NOT EXISTS hbase_hive_view
AS SELECT
CONCTNS.rowkey[0] AS date,
CONCTNS.rowkey[1] AS req_id,
t1,
t2
FROM
(SELECT split(key,'\\^') AS rowkey, t1, t2 FROM hbase_hive_table)
CONCTNS;

Related

How do I find unmatched records with a table that contains comma separated values

I am trying to check if the values from Table1 exist in Table2.
The thing is that the values are comma separated in Table1
Table 1
ID
TXT
1
129(a),P24
2
P112
3
P24,XX
4
135(a),135(b)
Table 2
ID
P24
P112
P129(a)
135(a)
135(b)
The following only works if the complete cell value exists in both tables:
SELECT Table1.ID, Table1.TXT
FROM Table1 LEFT JOIN Table2 ON Table1.[TXT] = Table2.[ID]
WHERE (((Table2.ID) Is Null));
MY QUESTION IS:
Is there a way to check each comma separated value and return those that do not exists in Table 2.
In above example the value XX should end up in the result.
Not sure why you store your data in that way (which is bad practice as sos mentioned above), but you need to mimic the temp table like in SQL server.
Select from table1 and create different txt rows per id.
Insert the results from section 1 into the table3.
Select from table3 and join it to table2.
Delete table 3.
Table3 the temp table
ID
TXT
1
129(a)
1
P24
2
P112
3
P24
3
XX
4
135(a)
4
135(b)
Here is some explanation MS Access database (2010) how to create temporary table/procedure/view from Query Designer

Data Loaded wrongly into Hive Partitioned table after adding a new column using ALTER

I already have a Hive partitioned table. I needed to add a new column to the table, so i used ALTER to add the column like below.
ALTER TABLE TABLE1 ADD COLUMNS(COLUMN6 STRING);
I have my final table load query like this:
INSERT OVERWRITE table Final table PARTITION(COLUMN4, COLUMN5)
select
stg.Column1,
stg.Column2,
stg.Column3,
stg.Column4(Partition Column),Field Name:Code Sample value - YAHOO.COM
stg.Column5(Partition Column),Field Name:Date Sample Value - 2021-06-25
stg.Column6(New Column) Field Name:reason sample value - Adjustment
from (
select fee.* from (
select
fees.* ,
ROW_NUMBER() OVER (PARTITION BY fees.Column1 ORDER BY fees.Column3 DESC) as RNK
from Stage table fee
) fee
where RNK = 1
) stg
left join (
select Column1 from Final table
where Column5(date) in (select distinct column5(date) from Stage table)
) TGT
on tgt.Column1(id) = stg.Column1(id) where tgt.column1 is null
UNION
select
tgt.column1(id),
tgt.column2,
tgt.column3,
tgt.column4(partiton column),
tgt.column5(partiton column-date),
tgt.column6(New column)
from
Final Table TGT
WHERE TGT.Column5(date) in (select distinct column5(date) from Stage table);"
Now when my job ran today, and when i try to query the final table, i get the below error
Invalid partition value 'Adjustment' for DATE partition key: Code=2021-06-25/date=Adjustment
I can figure out something wrong happend around the partition column but unable to figure out what went wrong..Can someone help?
Partition columns should be the last ones in the select. When you add new column it is being added as the last non-partition column, partition columns remain the last ones, they are not stored in the datafiles, only metadata contains information about partitions. All other columns order also matters, it should match table DDL, check it using DESCRIBE FORMATTED table_name.
INSERT OVERWRITE table Final table PARTITION(COLUMN4, COLUMN5)
select
stg.Column1,
stg.Column2,
stg.Column3,
stg.Column6 (New column) ------------New column
stg.Column4(Partition Column) ---partition columns
stg.Column5(Partition Column)
...

SQL : Compare table data and update the correct value in tables

I have the below data discrepancy in few of the tables which needs to be corrected using update queries in sql
the master table( take table A) table contains 2 primary key value for the same product like below,
------------------
PRRFNBR|PRNBR
-------|--------
XXXX |123
YYYY |123
----------------
And these reference keys used in 2 tables like below ,
Table B:
----------------------
SUPRFNBR |SUSPRNBR
---------------------
XXXX | 234
-------------------
Table C:
-------------------
SEPRFNBR | SESUPRNBR
-------------------
YYYY | 435
--------------------
Now I need to compare all these 3 tables and update the SEPRFNBR in TABLE C with the reference key available in table B (SUPRFNBR ) ( Like the reference key XXXX needs to be updated in table C if the same PRNBR is having 2 primary key values in TABLE A)
Your logic - as far as understood - does not need a reference to table A as different SUPRFNBR & SEPRFNBR would qualify for the update
update c set SEPRFNBR = (select SUPRFNBR from b where b.SUPRNBR = c.SEPRNBR)
If for some (undescribed) reason the lookup to table A is necessary it could be extended so something like
update c set SEPRFNBR = (select SUPRFNBR from b
where b.SUPRNBR = c.SEPRNBR)
and (select count(*) from a
where b.SUPRNBR = a.PRNBR) > 1)"
You may vary the solution depending on other side constraints you may have. This is just ment as a solution idea.

hive auto increment after certain number

I have a to insert data into a target table where all columns should be populated from different source tables except the surrogate key column; which should be maximum value of the target table plus auto increment value starting 1. I can generate auto increment value by using row_number() function, but in the same query how should I get the max value of surrogate key from target table. Is there any concept in HIVE where I can select the max value of surrogate key and save it in a temporary variable? Or is there any other simple way to achieve this result?
Here are two approaches which worked for me for the above problem. ( explained with example)
Approach 1: getting the max and setting to hive commands through ${hiveconf} variable using shell script
Approach 2: using row_sequence(), max() and join operations
My Environment:
hadoop-2.6.0
apache-hive-2.0.0-bin
Steps: (note: step 1 and step 2 are common for both approaches. Starting from step 3 , it differs for both)
Step 1: create source and target tables
source
hive>create table source_table1(string name);
hive>create table source_table2(string name);
hive>create table source_table2(string name);
target
hive>create table target_table(int id,string name);
Step 2: load data into source tables
hive>load data local inpath 'source_table1.txt' into table source_table1;
hive>load data local inpath 'source_table2.txt' into table source_table2;
hive>load data local inpath 'source_table3.txt' into table source_table3;
Sample Input:
source_table1.txt
a
b
c
source_table2.txt
d
e
f
source_table3.txt
g
h
i
Approach 1:
Step 3: create a shell script hive_auto_increment.sh
#!/bin/sh
hive -e 'select max(id) from target_table' > max.txt
wait
value=`cat max.txt`
hive --hiveconf mx=$value -e "add jar /home/apache-hive-2.0.0-bin/lib/hive-contrib-2.0.0.jar;
create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
set mx;
set hiveconf:mx;
INSERT INTO TABLE target_table SELECT row_sequence(),name from source_table1;
INSERT INTO TABLE target_table SELECT (\${hiveconf:mx} +row_sequence()),name from source_table2;
INSERT INTO TABLE target_table SELECT (\${hiveconf:mx} +row_sequence()),name from source_table3;"
wait
hive -e "select * from target_table;"
Step 4: run the shell script
> bash hive_auto_increment.sh
Approach 2:
Step 3: Add Jar
hive>add jar /home/apache-hive-2.0.0-bin/lib/hive-contrib-2.0.0.jar;
Step 4: register row_sequence function with help of hive contrib jar
hive>create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
Step 5: load the source_table1 to target_table
hive>INSERT INTO TABLE target_table select row_sequence(),name from source_table1;
Step 6: load the other sources to target_table
hive>INSERT INTO TABLE target_table SELECT M.rowcount+row_sequence(),T.name from source_table2 T join (select max(id) as rowcount from target_table) M;
hive>INSERT INTO TABLE target_table SELECT M.rowcount+row_sequence(),T.name from source_table3 T join (select max(id) as rowcount from target_table) M;
output:
INFO : OK
+---------------+-----------------+--+
| target_table.id | target_table.name
+---------------+-----------------+--+
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | d |
| 5 | e |
| 6 | f |
| 7 | g |
| 8 | h |
| 9 | i |
create table autoincrement1 ( id int, name string)
insert into autoincrement1
select if(isnull(max(id)) ,0 , max(id) ) +1, 'sagar' from autoincrement1

SQL Copy Multiple Column from Table 1 in DB A to Table2 in DB B

I have a table data_1 with 2 columns I would like to copy over to table sensor_reading. I would like the column tmp1 in data_1 to get copied over to reading in sensor_reading, and the column dt_s in data_1 to get copied over to reading_time in table sensor_reading.
The following is what I am trying, but I get "Update 0".
update sensor_reading
set reading = data_1.tmp1,
reading_time = data_1.dt_s,
sensor_id = 1
from data_1;
I think this is what you are trying to do if the data is not already in the table:
INSERT sensor_reading (reading, reading_time, sensor_id)
SELECT tmp1, dt_s, 1
FROM data_1