hive daily msck repair needed if new partition not added - hive

I have hive table which has data and its partitioned on a partition column which is based on Year.Now data is getting loaded daily into this hive table. I dont have an option to do daily msck repair. My partition is based on year. So do i need to msck repair after daily load if new partition is not added. I have tried below
val data = Seq(Row("1","2020-05-11 15:17:57.188","2020"))
val schemaOrig = List( StructField("key",StringType,true)
,StructField("txn_ts",StringType,true)
,StructField("txn_dt",StringType,true))
val sourceDf = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
sourceDf.write.mode("overwrite").partitionBy("txn_dt").avro("/test_a")
HIVE EXTERNAL TABLE
create external table test_a(
key string,
txn_ts string
)
partitioned by (txn_dt string)
stored as avro
location '/test_a';
msck repair table test_a;
select * from test_a;

Noticed if new partition not added msck repair is not needed
msck repair table test_a;
select * from test_a;
+----------------+--------------------------+------------------------+--+
| test_a.rowkey | test_a.txn_ts | test_a.order_entry_dt |
+----------------+--------------------------+------------------------+--+
| 1 | 2020-05-11 15:17:57.188 | 2020 |
+----------------+--------------------------+------------------------+--+
Now added 1 more row with the same partition value (2020)
val data = Seq(Row("2","2021-05-11 15:17:57.188","2020"))
val schemaOrig = List( StructField("rowkey",StringType,true)
,StructField("txn_ts",StringType,true)
,StructField("order_entry_dt",StringType,true))
val sourceDf = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
sourceDf.write.mode("append").partitionBy("order_entry_dt").avro("/test_a")
**HIVE QUERY rETURNED 2 ROWS**
select * from test_a;
+----------------+--------------------------+------------------------+--+
| test_a.rowkey | test_a.txn_ts | test_a.order_entry_dt |
+----------------+--------------------------+------------------------+--+
| 1 | 2021-05-11 15:17:57.188 | 2020 |
| 2 | 2020-05-11 15:17:57.188 | 2020 |
+----------------+--------------------------+------------------------+--+
--Now tried adding NEW PARTITION (2021) to see if select query will
return it with out msck
val data = Seq(Row("3","2021-05-11 15:17:57.188","2021"))
val schemaOrig = List( StructField("rowkey",StringType,true)
,StructField("txn_ts",StringType,true)
,StructField("order_entry_dt",StringType,true))
val sourceDf = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
sourceDf.write.mode("append").partitionBy("order_entry_dt").avro("/test_a")
QUERY AGAIN RETURNED 2 ROWS ONLY INSTEAD OF 3 ROWS with out msck repair

Related

Hive: merge or tag multiple rows based on neighboring rows

I have the following table and want to merge multiple rows based on neighboring rows.
INPUT
EXPECTED OUTPUT
The logic is that since "abc" is connected to "abcd" in the first row and "abcd" is connected to "abcde" in the second row and so on, thus "abc", "abcd", "abcde", "abcdef" are connected and put in one array. The same applied to the rest rows. The number of connected neighboring rows are arbitrary.
The question is how to do that using Hive script without any UDF. Do I have to use Spark for this type of operation? Thanks very much.
One idea I had is to tag rows first as
How to do that using Hive script only?
This is an example of a CONNECT BY query which is not supported in HIVE or SPARK, unlike DB2 or ORACLE, et al.
You can simulate such a query with Spark Scala, but it is far from handy. Putting a tag in means the question is less relevant then, imo.
Here is a work-around using Hive script to get the intermediate table.
drop table if exists step1;
create table step1 STORED as orc as
with src as
(
select split(u.tmp,",")[0] as node_1, split(u.tmp,",")[1] as node_2
from
(select stack (7,
"abc,abcd",
"abcd,abcde",
"abcde,abcdef",
"bcd,bcde",
"bcde,bcdef",
"cdef,cdefg",
"def,defg"
) as tmp
) u
)
select node_1, node_2, if(node_2 = lead(node_1, 1) over (order by node_1), 1, 0) as tag, row_number() OVER (order by node_1) as row_num
from src;
drop table if exists step2;
create table step2 STORED as orc as
SELECT tag, row_number() over (ORDER BY tag) as row_num
FROM (
SELECT cast(v.tag as int) as tag
FROM (
SELECT
split(regexp_replace(repeat(concat(cast(key as string), ","), end_idx-start_idx), ",$",""), ",") as tags --repeat the row number by the number of rows
FROM (
SELECT COALESCE(lag(row_num, 1) over(ORDER BY row_num), 0) as start_idx, row_num as end_idx, row_number() over (ORDER BY row_num) as key
FROM step1 where tag=0
) a
) b
LATERAL VIEW explode(tags) v as tag
) c ;
drop table if exists step3;
create table step3 STORED as orc as
SELECT
a.node_1, a.node_2, b.tag
FROM step1 a
JOIN step2 b
ON a.row_num=b.row_num;
The final table looks like
select * from step3;
+---------------+---------------+------------+
| step3.node_1 | step3.node_2 | step3.tag |
+---------------+---------------+------------+
| abc | abcd | 1 |
| abcd | abcde | 1 |
| abcde | abcdef | 1 |
| bcd | bcde | 2 |
| bcde | bcdef | 2 |
| cdef | cdefg | 3 |
| def | defg | 4 |
+---------------+---------------+------------+
The third column can be used to collect node pairs.

Merge into a BigQuery Partitioned Table via a Join Without Scanning Entire Table

Sample scenario..
I have "BigTable" with millions of rows and "TinyTable" with just a few rows. I need to merge some information from TinyTable into BigTable.
BigTable is partitioned by the column "date_time". My merge will join on date_time and ID.
I really only need the ID column to do the join, but I thought having the date_time column there as well would allow BQ to prune the partitions and look only at the dates necessary. Nope. It does a full scan on BigTable (billing me for Gigabytes of data)... even if TinyTable just has one value (i.e. from one date) in it.
BigTable
+---------------------------+---------+-------+
| date_time | ID | value |
+---------------------------+---------+-------+
| '2019-03-13 00:00:00 UTC' | 100 | .2345 |
| '2019-03-13 00:00:00 UTC' | 101 | .65 |
| '2019-03-14 00:00:00 UTC' | 102 | .648 |
| [+50 millions rows...] | | |
+---------------------------+---------+-------+
TinyTable
+---------------------------+---------+-------+
| date_time | ID | value |
+---------------------------+---------+-------+
| '2019-03-13 00:00:00 UTC' | 100 | .555 |
| '2019-03-14 00:00:00 UTC' | 102 | .666 |
| | | |
+---------------------------+---------+-------+
...
Uses 8 GB...
MERGE BigTable
USING TinyTable
ON BigTable.date_time = TinyTable.date_time and BigTable.id = TinyTable.id
WHEN MATCHED THEN
UPDATE SET date_time = TinyTable.date_time, value = TinyTable.value
WHEN NOT MATCHED THEN
INSERT (date_time, id , value) values (date_time, id , value);
Uses 8 GB...
update BigTable
set value = TinyTable.value
from
TinyTable where
BigTable.date_time = TinyTable.date_time
and
BigTable.id = TinyTable.id
Works as expected (only 12 MB) if I hard-code in a timestamp literal instead of using the value from the join (but not what I'm after)...
update BigTable
set value = TinyTable.value
from
TinyTable where
BigTable.date_time = '2019-03-13 00:00:00 UTC'
and
BigTable.id = TinyTable.id
I need to run something like this hundreds of times per day. As-is, it's not sustainable cost-wise. What am I missing?
Thanks!
With BigQuery scripting (Beta now), there is a way to reduce the cost.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
DECLARE date_filter ARRAY<DATETIME>
DEFAULT (SELECT ARRAY_AGG(DISTINCT d) FROM TinyTable);
update BigTable
set value = TinyTable.value
from
TinyTable where
BigTable.date_time in UNNEST(date_filter) --This prunes the partition to be scanned
AND
BigTable.date_time = TinyTable.date_time
and
BigTable.id = TinyTable.id;
Possible solution 1:
Get all data from certain partition and save it to temporary table
Do update/merge statement to temporary table
Rewrite partition with temporary table content
For step 3 - you can access certain partitions using $ decorator: Dataset.BigTable$20190926
Possible solution 2:
You can schedule python script to run SQL queries like the last one. Google offers nice library. You can even run them in parallel using ThreadPoolExecutor from concurrent.futures or any other threading library.

How to delete hive table records ?

how to delete hive table records, we have 100 records there and i need to delete 10 records only,
when i use
dfs -rmr table_name whole table deleted
if any chance to delete in Hbase , send to data in Hbase,
You cannot delete directly from Hive table,
However, you can use a workaround of overwriting into Hive table
insert overwrite into table_name
select * from table_name
where id in (1,2,3,...)
You can't delete data from Hive tables since it is already written in the files in HDFS. You can only drop partitions which deletes directories in HDFS. So best practice is to have partitions if you want to delete in the future.
To delete records in a table, you can use the SQL syntax from your hive client :
DELETE FROM tablename [WHERE expression]
Try with where and your key with in clause
DELETE FROM tablename where id in (select id from tablename limit 10);
Example:-
I had acid transactional table in hive
select * from trans;
+-----+-------+--+
| id | name |
+-----+-------+--+
| 2 | hcc |
| 1 | hi |
| 3 | hdp |
+-----+-------+--+
Now i want to delete only 2 then my delete statement would be
delete from trans where id in (select id from trans limit 1);
Result:-
select * from trans;
+-----+-------+--+
| id | name |
+-----+-------+--+
| 1 | hi |
| 3 | hdp |
+-----+-------+--+
So we have just deleted the first record like this way you can specify limit 10 then hive can delete first 10 records.
you can specify orderby... some other clauses in your subquery if you need to delete only first 10 having specific order(like delete id's from 1 to 10).

Hive insert overwrite directory stored as parquet made NULL values

I'm trying to add some data in one directory, and after to add these data as partition to a table.
create table test (key int, value int) partitioned by (dt int) stored as parquet location '/user/me/test';
insert overwrite directory '/user/me/test/dt=1' stored as parquet select 123, 456, 1;
alter table test add partition (dt=1);
select * from test;
This code sample is simple... but don't work. With the select statement, the output is NULL, NULL, 1. But I need 123, 456, 1.
When I read the data with Impala, I received 123, 456, 1... what is expected.
Why ? What is wrong ?
If I removed the two "stored as parquet", it's all ok... but I want my data in parquet !
PS : I want this construct for a switch of partition, so that when the data are calculated, they don't go to the user...
Identifying the issue
hive
create table test (key int, value int)
partitioned by (dt int)
stored as parquet location '/user/me/test'
;
insert overwrite directory '/user/me/test/dt=1'
stored as parquet
select 123, 456
;
alter table test add partition (dt=1)
;
select * from test
;
+----------+------------+---------+
| test.key | test.value | test.dt |
+----------+------------+---------+
| NULL | NULL | 1 |
+----------+------------+---------+
bash
parquet-tools cat hdfs://{fs.defaultFS}/user/me/test/dt=1/000000_0
_col0 = 123
_col1 = 456
Verifying the issue
hive
alter table test change column `key` `_col0` int cascade;
alter table test change column `value` `_col1` int cascade;
select * from test
;
+------------+------------+---------+
| test._col0 | test._col1 | test.dt |
+------------+------------+---------+
| 123 | 456 | 1 |
+------------+------------+---------+
Suggestd Solution
create additional table test_admin and do the insert through it
create table test_admin (key int, value int)
partitioned by (dt int)
stored as parquet location '/user/me/test'
;
create external table test (key int, value int)
partitioned by (dt int)
stored as parquet
location '/user/me/test'
;
insert into test_admin partition (dt=1) select 123, 456
;
select * from test_admin
;
+----------+------------+---------+
| test.key | test.value | test.dt |
+----------+------------+---------+
| 123 | 456 | 1 |
+----------+------------+---------+
select * from test
;
(empty result set)
alter table test add partition (dt=1)
;
select * from test
;
+----------+------------+---------+
| test.key | test.value | test.dt |
+----------+------------+---------+
| 123 | 456 | 1 |
+----------+------------+---------+

Parquet-backed Hive table: array column not queryable in Impala

Although Impala is much faster than Hive, we used Hive because it supports complex (nested) data types such as arrays and maps.
I notice that Impala, as of CDH5.5, now supports complex data types. Since it's also possible to run Hive UDF's in Impala, we can probably do everything we want in Impala, but much, much faster. That's great news!
As I scan through the documentation, I see that Impala expects data to be stored in Parquet format. My data, in its raw form, happens to be a two-column CSV where the first column is an ID, and the second column is a pipe-delimited array of strings, e.g.:
123,ASDFG|SDFGH|DFGHJ|FGHJK
234,QWERT|WERTY|ERTYU
A Hive table was created:
CREATE TABLE `id_member_of`(
`id` INT,
`member_of` ARRAY<STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
The raw data was loaded into the Hive table:
LOAD DATA LOCAL INPATH 'raw_data.csv' INTO TABLE id_member_of;
A Parquet version of the table was created:
CREATE TABLE `id_member_of_parquet` (
`id` STRING,
`member_of` ARRAY<STRING>)
STORED AS PARQUET;
The data from the CSV-backed table was inserted into the Parquet table:
INSERT INTO id_member_of_parquet SELECT id, member_of FROM id_member_of;
And the Parquet table is now queryable in Hive:
hive> select * from id_member_of_parquet;
123 ["ASDFG","SDFGH","DFGHJ","FGHJK"]
234 ["QWERT","WERTY","ERTYU"]
Strangely, when I query the same Parquet-backed table in Impala, it doesn't return the array column:
[hadoop01:21000] > invalidate metadata;
[hadoop01:21000] > select * from id_member_of_parquet;
+-----+
| id |
+-----+
| 123 |
| 234 |
+-----+
Question: What happened to the array column? Can you see what I'm doing wrong?
It turned out to be really simple: we can access the array by adding it to the FROM with a dot, e.g.
Query: select * from id_member_of_parquet, id_member_of_parquet.member_of
+-----+-------+
| id | item |
+-----+-------+
| 123 | ASDFG |
| 123 | SDFGH |
| 123 | DFGHJ |
| 123 | FGHJK |
| 234 | QWERT |
| 234 | WERTY |
| 234 | ERTYU |
+-----+-------+