Multiple Parquet files while writing to Hive Table(Incremental)

Multiple Parquet files while writing to Hive Table(Incremental) - hive

Having a Hive table that's partitioned
CREATE EXTERNAL TABLE IF NOT EXISTS CUSTOMER_PART (
NAME string ,
AGE int ,
YEAR INT)
PARTITIONED BY (CUSTOMER_ID decimal(15,0))
STORED AS PARQUET LOCATION 'HDFS LOCATION'
The first LOAD is done from ORACLE to HIVE via PYSPARK using
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID) SELECT NAME, AGE, YEAR, CUSTOMER_ID FROM CUSTOMER;
Which works fine and creates partition dynamically during the run. Now coming to data loading incrementally everyday creates individual files for a single record under the partition.
INSERT INTO TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3; --Assume this gives me the latest record in the database
Is there a possibility to have the value appended to the existing parquet file under the partition until it reaches it block size, without having smaller files created for each insert.
Rewriting the whole partition is one option but I would prefer not to do this
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3;
The following properties are set for the Hive
set hive.execution.engine=tez; -- TEZ execution engine
set hive.merge.tezfiles=true; -- Notifying that merge step is required
set hive.merge.smallfiles.avgsize=128000000; --128MB
set hive.merge.size.per.task=128000000; -- 128MB
Which still doesn't help with daily inserts. Any alternate approach that can be followed would be really helpful.

As Per my knowledge we cant store the single file for daily partition data since data will be stored by different part files for each day partition.
Since you mention that you are importing the data from Oracle DB so you can import the entire data each time from oracle DB and overwrite into HDFS. By this way you can maintain the single part file.
Also HDFS is not recommended for small amount data.

I could think of the following approaches for this case:
Approach1:
Recreating the Hive Table, i.e after loading incremental data into CUSTOMER_PART table.
Create a temp_CUSTOMER_PART table with entire snapshot of CUSTOMER_PART table data.
Run overwrite the final table CUSTOMER_PART selecting from temp_CUSTOMER_PART table
In this case you are going to have final table without small files in it.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created.
Approach2:
Using input_file_name() function by making use of it:
check how many distinct filenames are there in each partition then select only the partitions that have more than 10..etc files in each partition.
Create an temporary table with these partitions and overwrite the final table only the selected partitions.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created because we are going to overwrite the final table.
Approach3:
Hive(not spark) offers overwriting and select same table .i.e
insert overwrite table default.t1 partition(partiton_column)
select * from default.t1; //overwrite and select from same t1 table
If you are following this way then there needs to be hive job triggered once your spark job finishes.
Hive will acquire lock while running overwrite/select the same table so if any job which is writing to table will wait.
In Addition: Orc format will offer concatenate which will merge small ORC files to create a new larger file.
alter table <db_name>.<orc_table_name> [partition_column="val"] concatenate;

Related

Copy parquet file content into an SQL temp table and include partition key as column

I have multiple parquet files in S3 that are partitioned by date, like so:
s3://mybucket/myfolder/date=2022-01-01/file.parquet
s3://mybucket/myfolder/date=2022-01-02/file.parquet
and so on.
All of the files follow the same schema, except some which is why I am using the FILLRECORD (to fill the files with NULL values in case a column is not present). Now I want to load the content of all these files into an SQL temp table in redshift, like so:
DROP TABLE IF EXISTS table;
CREATE TEMP TABLE table
(
var1 bigint,
var2 bigint,
date timestamp
);
COPY table
FROM 's3://mybucket/myfolder/'
access_key_id 'id'secret_access_key 'key'
PARQUET FILLRECORD;
The problem is that the date column is not a column in the parquet files which is why the date column in the resulting table is NULL. I am trying to find a way to use the date to be inserted into the temp table.
Is there any way to do this?

I believe there are only 2 approaches to this:
Perform N COPY commands, one per S3 partition value, and populate the date column with the same information as the partition key value as a literal. A simple script can issue the SQL to Redshift. The issue with this is that you are issuing many COPY commands and if each partition in S3 has only 1 parquet file (or a few files) this will not take advantage of Redshift's parallelism.
Define the region of S3 with the partitioned parquet files as a Redshift partitioned external table and then INSERT INTO (SELECT * from );. The external table knows about the partition key and can insert this information into the local table. The down side is that you need to define the external schema and table and if this is a one time process, you will want to then tear these down after.
There are some other ways to attack this but none that are worth the effort or will be very slow.

Is it possible to change partition metadata in HIVE?

This is an extension of a previous question I asked: How to compare two columns with different data type groups
We are exploring the idea of changing the metadata on the table as opposed to performing a CAST operation on the data in SELECT statements. Changing the metadata in the MySQL metastore is easy enough. But, is it possible to have that metadata change applied to partitions (they are daily)? Otherwise, we might be stuck with current and future data being of type BIGINT while the historical is STRING.
Question: Is it possible to change partition meta data in HIVE? If yes, how?

You can change partition column type using this statement:
alter table {table_name} partition column ({column_name} {column_type});
Also you can re-create table definition and change all columns types using these steps:
Make your table external, so it can be dropped without dropping the data
ALTER TABLE abc SET TBLPROPERTIES('EXTERNAL'='TRUE');
Drop table (only metadata will be removed).
Create EXTERNAL table using updated DDL with types changed and with the same LOCATION.
recover partitions:
MSCK [REPAIR] TABLE tablename;
The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is:
ALTER TABLE tablename RECOVER PARTITIONS;
This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS
And finally you can make you table MANAGED again if necessary:
ALTER TABLE tablename SET TBLPROPERTIES('EXTERNAL'='FALSE');
Note: All commands above should be ran in HUE, not MySQL.

You can not change the partition column in hive infact Hive does not support alterting of partitioning columns
Refer : altering partition column type in Hive
You can think of it this way
- Hive stores the data by creating a folder in hdfs with partition column values
- Since if you trying to alter the hive partition it means you are trying to change the whole directory structure and data of hive table which is not possible
exp if you have partitioned on year this is how directory structure looks like
tab1/clientdata/2009/file2
tab1/clientdata/2010/file3
If you want to change the partition column you can perform below steps
Create another hive table with required changes in partition column
Create table new_table ( A int, B String.....)
Load data from previous table
Insert into new_table partition ( B ) select A,B from table Prev_table

Hive - Huge 10TB table repartitioning (Adding new partition columns)

Techies,
Background -
We have 10TB existing hive table which has been range partitioned on column A. Business case has changes which now require adding of partition column B in addition to Column A.
Problem statement -
Since data on HDFS is too huge and needs to be restructured to inherit the new partition column B, we are facing difficulty to copy over table onto backup and re-ingest using simple IMPALA INSERT OVERWRITE into main table.
We want to explore if there is/ are efficient way to handle adding over partition columns to such huge table

Alright!
If I understand your situation correctly, you have a table backed by 10 TB of data in HDFS with partition on Column A and you want to add the partition also on Column B.
So, if Column B is going to be the sub partition, the HDFS directory would look like user/hive/warehouse/database/table/colA/colB or /colB/colA otherwise (considering it as an managed table).
Restructuring the HDFS directory manually won't be a great idea because it will become a nightmare to scan the data on all files and organize it accordingly in its corresponding folder.
Below is one way of doing it,
1. Create a new table with new structure - i.e., with partitions on Col A and Col B.
CREATE TABLE NEWTABLE ( COLUMNS ... ) PARTITON ON ( COL_A INT, COL_B INT )
2.a. Insert data from the old table to the new table (created in Step #1) like below,
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
But Yes, this step is going to consume a lot of resources during execution if not handled properly, space in HDFS for storing the results as data for NEWTABLE and of-course the time.
OR
2.b. If you think that HDFS will not have enough space to hold all the data or resource crunch, I would suggest you to this INSERT in batches with removal of old data after each INSERT operations.
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
WHERE COL_A='abc'
DELETE FROM OLDTABLE
WHERE COL_A='abc'
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
WHERE COL_A='def'
DELETE FROM OLDTABLE
WHERE COL_A='def'
.
.
.
so on.
This way, you can unload HDFS with already handled data and balancing the space.
If you follow step 2.b. then you can write a Script to automate this process by passing the partition names (derived from SHOW PARTITIONS) dynamically for each run. But, try the first two attempts manually before going with automation to make sure things go as expected.
Let me know if it helps!

Need to change the partition column to another column and reloading the data into new partitions

I am trying to change the already existing partition column to another column.
The current workflow I'm using:
Backup the existing data
Create a new table with new partition column
Reload the data into new partitions
My problem:
Since there is huge data in our existing partition tables, this way will be costly
Is there a way we can do Alter table and change partition column name to another?

You can not avoid 1-time cost of scanning the table as you can see from the error message generated from this CREATE OR REPLACE DML command
#standardSQL
CREATE OR REPLACE TABLE `project.dataset.table`
PARTITION BY DATE(ts)
AS
SELECT * FROM `project.dataset.table`
Cannot replace a table with a different partitioning spec. Instead, DROP the table, and then recreate it. New partitioning spec is interval(type:day,field:ts) and existing spec is none
What you can do to save cost is use the WHERE command to limit the number of the partition you move from existing table to the new table
CREATE TABLE project.mydataset.newPartitionTable
PARTITION BY date
OPTIONS (
partition_expiration_days=365,
description="Table with a new partition"
) AS
SELECT * from `project.dataset.table` WHERE
PARTITIONTIME >= '2019-01-23 00:00:00'
AND _PARTITIONTIME <= '2019-01-23 00:00:00'
You can consider for example not to move your Long-term storage which is data you haven't access for the last 90 days (see this link for more details)
If you want to keep your original table name you can drop/create it with the new partition field, after the copy, and use the copy option from webUI which will be free of charge

Will hive dynamic partitioning update all partitions?

I want to use the hive dynamic partitioning to overwrite a partitioned table "page_view":
INSERT OVERWRITE TABLE page_view PARTITION(date)
SELECT pvs.viewTime FROM page_view_stg pvs
My question is : If the table "page_view_stg" only has the data of "date=2017-01-01", but the dest table "page_view" has a partition "date=2017-01-02". So after running this query, will the partition "date=2017-01-02" get dropped or not? If not, how should I handle this case using dynamic partitioning?
Thanks

Query with dynamic partitioning will overwrite only partitions existing in the source dataset. In your case partition "date=2017-01-02" will remain unchanged if the the source table does not contain such date. If you want to drop it, the fastest method is to execute alter table drop partition statement because this is the metadata operation. You select partitions from target table which do not exist in the source and generate drop statements using shell. Or insert into new table, drop old target, then rename.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas