Merge Existing Partition in HIVE - hive

How to merge existing Partitions and make it to one Partition.
For ex : I have Partitions on Year column like year=2011,year=2012,year=2013,year=2014.
My requirement is to merge partitions from 2011 to 2013 partitions.
So that I can have only 2 partitions 2013 and 2014.
Please help.
Regards,
Manoj

1) create new target table
2) Insert data into target table with dynamic partition loading:
insert overwrite table partition (partition_year)
select col1, col2 ..., case when year between 2011 and 2013 then 2013
when year >=2014 then 2014
end as partition_year from source_table
3) drop source_table

If your partitioning column year is defined as STRING, then you can just...
create a new partition for year=History (for instance)
move brutally the data files from directories such as .../year=2011/ to the new dir .../year=History/
drop the partitions that are now empty

Related

Hive- how do I "create table as select.." with partitions from original table?

I need to create a "work table" from our hive dlk. While I can use:
create table my_table as
select *
from dlk.big_table
just fine, I have problem with carrying over partitions (attributes day, month and year) from original "big_table" or just creating new ones from these attributes.
Searching the web did not really helped me answer this question- all "tutorials" or solutions deal either with create as select OR creating partitions, never both.
Can anybody here please help?
Creating partitioned table as select is not supported. You can do it in two steps:
create table my_table like dlk.big_table;
This will create table with the same schema.
Load data.
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table my_table partition (day, month, year)
select * from dlk.big_table;

Hive - Create Table statement with 'select query' and 'partition by' commands

I want to create a Partitioned Table in Hive. I know to create a table structure first with the help of "Create table ... Partitioned by" command and then insert the data into the table using "Insert Into Table" command
But what I am trying to do is to combine these two commands into a single query like below but it is throwing errors.
CREATE TABLE test_extract AS
SELECT
*
FROM master_extract
PARTITION BY (year string
,month string)
;
Both Year and Month are two separate columns in the master_extract table.
Is there any way to achieve something like this ?
No, this is not possible, because Create Table As Select (CTAS) has restrictions:
The target table cannot be a partitioned table.
The target table cannot be an external table.
The target table cannot be a list bucketing table.
You can create table separately and then insert overwrite it.
There has been some development since this question was originally asked and answered. As per hive documentation: Starting with Hive 3.2.0, CTAS statements can define a partitioning specification for the target table (HIVE-20241).
You can also see the related ticket here. It has been resolved back in July 2018.
Therefore if your hive is of 3.2.0 or higher, then you can simply do
CREATE TABLE test_extract PARTITIONED BY (year string, month string) AS
SELECT
col1,
col2,
year,
month
FROM master_extract

dropping hive partition dynamically

I have an HIVE table with daily partitions day wise, something like below (which includes future date's partition as well)
20160901
20160902
........
........
........
20160931
20161001
20161002
I want to pass one date say for example yesterday's date 20160922 and want to drop all partitions dynamically which are >= 20160922 (though today is 20160923, but I want to drop from 20160922 date).
How can I can drop all these partitions dynamically.
You can not do in hive directly as it does not support dynamic sql.
There can be work around using shell script/or any script create file having drop partition script like below.
alter table partition_t drop if exists partition (y=20160922 );
alter table partition_t drop if exists partition (y=20160921 );
alter table partition_t drop if exists partition (y=20160920 );
...
then run hive -v -f ./file.sh
alter table partition_t drop if exists partition
Before Inserting Data Into Table Perform the below steps.
1) Go to Hdfs Folder of that table and delete all the folders Inside
Table Directory using Shell Commands. hadoop fs -rm r <>
2) Run MSCK repair Table to update the metadata about partitions.
above two steps will delete all the available partitions based on pattern.
Now Insert your new data.
You can drop partitions giving a range filter. For reference see that answer : https://stackoverflow.com/a/48422251/3132181
So your code could be like that:
Alter table mytable drop partition (datehour >= '20160922')

Add partitions on existing hive table

I'm processing a big hive's table (more than 500 billion records).
The processing is too slow and I would like to make it faster.
I think that by adding partitions, the process could be more efficient.
Can anybody tell me how I can do that?
Note that my table already exists.
My table :
create table T(
nom string,
prenom string,
...
date string)
Partitioning on date field.
Thx
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
INSERT OVERWRITE TABLE table_name PARTITION(Date) select date from table_name;
Note :
In the insert statement for a partitioned table make sure that you are specifying the partition columns at the last in select clause.
You have to restructure the table. Here are the steps:
Make sure no other process is writing to the table.
Create new external table using partitioning
Insert into new table by selecting from the old table
Drop the new table (external), only table will be dropped but data will be there
Drop the old table
Create the table with original name by pointing to the location under step 2
You can run repair command to fix all the metadata.
Alternative 4, 5, 6 and 7
Create the table with original name by running show create table on new table and replace with original table name
Run LOAD DATA INPATH command to move files under partitions to new partitions of new table
Drop the external table created
Both the approaches will achieve restructuring with one insert/map reduce job.

How to Update/Drop a Hive Partition?

After adding a partition to an external table in Hive, how can I update/drop it?
You can update a Hive partition by, for example:
ALTER TABLE logs PARTITION(year = 2012, month = 12, day = 18)
SET LOCATION 'hdfs://user/darcy/logs/2012/12/18';
This command does not move the old data, nor does it delete the old data. It simply sets the partition to the new location.
To drop a partition, you can do
ALTER TABLE logs DROP IF EXISTS PARTITION(year = 2012, month = 12, day = 18);
in addition, you can drop multiple partitions from one statement (Dropping multiple partitions in Impala/Hive).
Extract from above link:
hive> alter table t drop if exists partition (p=1),partition (p=2),partition(p=3);
Dropped the partition p=1
Dropped the partition p=2
Dropped the partition p=3
OK
EDIT 1:
Also, you can drop bulk using a condition sign (>,<,<>), for example:
Alter table t
drop partition (PART_COL>1);
Alter table table_name drop partition (partition_name);
You can either copy files into the folder where external partition is located or use
INSERT OVERWRITE TABLE tablename1 PARTITION (partcol1=val1, partcol2=val2...)...
statement.
You may also need to make database containing table active
use [dbname]
otherwise you may get error (even if you specify database i.e. dbname.table )
FAILED Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter partition. Unable to alter partitions because table or database does not exist.