Insert overwrite while other process write on same table - hive

How the hive will behave if I use insert overwrite on partitioned external and non partitioned external table while other processes are writing into the same table?
I am trying below non partitioned table:
insert overwrite table customer_master (select * from customer_master);
Other Process:
insert into table customer_master select a, b, c;

By default, transaction is turned off in Hive. There is no conflict detections for different sessions updating the same data.
Statements in hive are just map-reduce(or spark, tez, etc.) jobs, they will run independently. Since they operate on the same table and in the end will write to the same directory, if the insert into job finish before the insert overwrite job, the first result will be overwriten. Because the insert overwrite job will clean the directory before writing the result.
To avoid this, use hive transaction.

Related

Using ingestion-time based pseudo-field (_PARTITIONTIME) as partition while clustering

I'd like to cluster our ingestion-time partitioned tables without having to change the ETL scripts we use to update them. All of our tables are partitioned on the pseudo-field _PARTITIONTIME, now when I try cluster a table with DML I get the following error:
Invalid field name "_PARTITIONTIME". Field names are not allowed to start with the (case-insensitive) prefixes _PARTITION, TABLE, FILE and _ROW_TIMESTAMP
Here's what the DML-script looks like:
CREATE TABLE `table_target`
PARTITION BY DATE(_PARTITIONTIME)
CLUSTER BY a, b, c
AS
SELECT
*, _PARTITIONTIME
FROM
`table_source`
How should I go about this? Is there a way to keep the same pseudo-field as the partition field, should I re-work the partition field, or am I missing something here?
It is Known limitation that:
It is not possible to create an ingestion-time partitioned table from the result of a query. Instead, use a CREATE TABLE DDL statement to create the table, and then use an INSERT DML statement to insert data into it.
In your case, you need to use CREATE TABLE to create target_table with CLUSTER BY first, then migrate data over.

Hive - Huge 10TB table repartitioning (Adding new partition columns)

Techies,
Background -
We have 10TB existing hive table which has been range partitioned on column A. Business case has changes which now require adding of partition column B in addition to Column A.
Problem statement -
Since data on HDFS is too huge and needs to be restructured to inherit the new partition column B, we are facing difficulty to copy over table onto backup and re-ingest using simple IMPALA INSERT OVERWRITE into main table.
We want to explore if there is/ are efficient way to handle adding over partition columns to such huge table
Alright!
If I understand your situation correctly, you have a table backed by 10 TB of data in HDFS with partition on Column A and you want to add the partition also on Column B.
So, if Column B is going to be the sub partition, the HDFS directory would look like user/hive/warehouse/database/table/colA/colB or /colB/colA otherwise (considering it as an managed table).
Restructuring the HDFS directory manually won't be a great idea because it will become a nightmare to scan the data on all files and organize it accordingly in its corresponding folder.
Below is one way of doing it,
1. Create a new table with new structure - i.e., with partitions on Col A and Col B.
CREATE TABLE NEWTABLE ( COLUMNS ... ) PARTITON ON ( COL_A INT, COL_B INT )
2.a. Insert data from the old table to the new table (created in Step #1) like below,
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
But Yes, this step is going to consume a lot of resources during execution if not handled properly, space in HDFS for storing the results as data for NEWTABLE and of-course the time.
OR
2.b. If you think that HDFS will not have enough space to hold all the data or resource crunch, I would suggest you to this INSERT in batches with removal of old data after each INSERT operations.
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
WHERE COL_A='abc'
DELETE FROM OLDTABLE
WHERE COL_A='abc'
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
WHERE COL_A='def'
DELETE FROM OLDTABLE
WHERE COL_A='def'
.
.
.
so on.
This way, you can unload HDFS with already handled data and balancing the space.
If you follow step 2.b. then you can write a Script to automate this process by passing the partition names (derived from SHOW PARTITIONS) dynamically for each run. But, try the first two attempts manually before going with automation to make sure things go as expected.
Let me know if it helps!

Multiple Parquet files while writing to Hive Table(Incremental)

Having a Hive table that's partitioned
CREATE EXTERNAL TABLE IF NOT EXISTS CUSTOMER_PART (
NAME string ,
AGE int ,
YEAR INT)
PARTITIONED BY (CUSTOMER_ID decimal(15,0))
STORED AS PARQUET LOCATION 'HDFS LOCATION'
The first LOAD is done from ORACLE to HIVE via PYSPARK using
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID) SELECT NAME, AGE, YEAR, CUSTOMER_ID FROM CUSTOMER;
Which works fine and creates partition dynamically during the run. Now coming to data loading incrementally everyday creates individual files for a single record under the partition.
INSERT INTO TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3; --Assume this gives me the latest record in the database
Is there a possibility to have the value appended to the existing parquet file under the partition until it reaches it block size, without having smaller files created for each insert.
Rewriting the whole partition is one option but I would prefer not to do this
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3;
The following properties are set for the Hive
set hive.execution.engine=tez; -- TEZ execution engine
set hive.merge.tezfiles=true; -- Notifying that merge step is required
set hive.merge.smallfiles.avgsize=128000000; --128MB
set hive.merge.size.per.task=128000000; -- 128MB
Which still doesn't help with daily inserts. Any alternate approach that can be followed would be really helpful.
As Per my knowledge we cant store the single file for daily partition data since data will be stored by different part files for each day partition.
Since you mention that you are importing the data from Oracle DB so you can import the entire data each time from oracle DB and overwrite into HDFS. By this way you can maintain the single part file.
Also HDFS is not recommended for small amount data.
I could think of the following approaches for this case:
Approach1:
Recreating the Hive Table, i.e after loading incremental data into CUSTOMER_PART table.
Create a temp_CUSTOMER_PART table with entire snapshot of CUSTOMER_PART table data.
Run overwrite the final table CUSTOMER_PART selecting from temp_CUSTOMER_PART table
In this case you are going to have final table without small files in it.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created.
Approach2:
Using input_file_name() function by making use of it:
check how many distinct filenames are there in each partition then select only the partitions that have more than 10..etc files in each partition.
Create an temporary table with these partitions and overwrite the final table only the selected partitions.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created because we are going to overwrite the final table.
Approach3:
Hive(not spark) offers overwriting and select same table .i.e
insert overwrite table default.t1 partition(partiton_column)
select * from default.t1; //overwrite and select from same t1 table
If you are following this way then there needs to be hive job triggered once your spark job finishes.
Hive will acquire lock while running overwrite/select the same table so if any job which is writing to table will wait.
In Addition: Orc format will offer concatenate which will merge small ORC files to create a new larger file.
alter table <db_name>.<orc_table_name> [partition_column="val"] concatenate;

Multi Table Insert into Single table in Hive

I have a partitioned hive table partitioned on column 'part'. The table has two partition values part='good' and part='bad'.
I need to move a record from 'bad' partition into 'good' partition and overwrite 'bad' partition to remove that moved record. To complicate this, I am looking for a way to do it in a single query as exception handling would be difficult otherwise.
I tried to do it with multi-table insert having two insert queries on the same table as below,
from tbl_partition
insert into tbl_partition partition (part='good') select a,b,c where a='a' and part='bad' -- this is where a record is moved from bad to good
insert overwrite table tbl_partition partition (part='bad') select a,b,c where part='bad' and a not in ('a'); -- Overwrite the bad partition excluding already moved record
But the above query always does an insert into, rather than one insert and the other insert overwrite!!
I even tried with a common table expression and used the common table to insert simultaneously into this table with no luck!
Is there any other way this can be achieved in a single query or am I doing something wrong in the above step?
Please note that I am doing this on a HDP cluster with hive 1.2

Effective way to move data from one table into multiple tables

I have TableA that has millions of records and 40 columns.
I would like to move:
- columns 1-30 into Table B
- columns 31-40 into Table C
This multiple Insert question shows how I would assume I should do it
INSERT INTO TableB (col1, col2, ...)
SELECT c1, c2,...
FROM TableA...
I wanted to know if there was a different/quicker way I could pass the data. Essentially, I don't want to wait for One table to finish Insert processing before the other Insert statement starts to execute
I'm afraid there is no way in the SQL standard to have what is often called a T junction at the end of an INSERT .. SELECT. This, I'm afraid, is the privilege of ETL tools. But the ETL tools connect twice to the database, once for each leg of the T junction, and the resulting two INSERT INTO tab_x VALUES (?,?,?,?) statements run in parallel.
Which brings me to a possible solution that could make sense:
Create two scripts. One goes INSERT INTO table_b1 SELECT col1,col2 FROM table_a;. One goes INSERT INTO table_b2 SELECT col3,col4 FROM table_a;. Then, as it's SQL server, launch two isql sessions in parallel, each running their own script.