I have an HIVE table with daily partitions day wise, something like below (which includes future date's partition as well)
20160901
20160902
........
........
........
20160931
20161001
20161002
I want to pass one date say for example yesterday's date 20160922 and want to drop all partitions dynamically which are >= 20160922 (though today is 20160923, but I want to drop from 20160922 date).
How can I can drop all these partitions dynamically.
You can not do in hive directly as it does not support dynamic sql.
There can be work around using shell script/or any script create file having drop partition script like below.
alter table partition_t drop if exists partition (y=20160922 );
alter table partition_t drop if exists partition (y=20160921 );
alter table partition_t drop if exists partition (y=20160920 );
...
then run hive -v -f ./file.sh
alter table partition_t drop if exists partition
Before Inserting Data Into Table Perform the below steps.
1) Go to Hdfs Folder of that table and delete all the folders Inside
Table Directory using Shell Commands. hadoop fs -rm r <>
2) Run MSCK repair Table to update the metadata about partitions.
above two steps will delete all the available partitions based on pattern.
Now Insert your new data.
You can drop partitions giving a range filter. For reference see that answer : https://stackoverflow.com/a/48422251/3132181
So your code could be like that:
Alter table mytable drop partition (datehour >= '20160922')
Related
I have an external table, now I want to add partitions to it. I have 224 unique city id's and I want to just write alter table my_table add partition (cityid) location /path; but hive complains, saying that I don't provide anything for the city id value, it should be e.g. alter table my_table add partition (cityid=VALUE) location /path;, but I don't want to run alter table commands for every value of city id, how can I do it for all id's in one go?
This is what hive command line looks like:
hive> alter table pavel.browserdata add partition (cityid) location '/user/maria_dev/data/cityidPartition';
FAILED: ValidationFailureSemanticException table is not partitioned but partition spec exists: {cityid=null}
Partition on physical level is a location (separate location for each value, usually looks like key=value) with data files. If you already have partitions directory structure with files, all you need is to create partitions in Hive metastore, then you can point your table to the root directory using ALTER TABLE SET LOCATION, then use MSCK REPAIR TABLE command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS. This will add Hive partitions metadata. See manual here: RECOVER PARTITIONS
If you have only not-partitioned table with data in it's location, then adding partitions will not work because the data needs to be reloaded, you need to:
Create another partitioned table and use insert overwrite to load partition data using dynamic partition load:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table2 partition(cityid)
select col1, ... colN,
cityid
from table1; --partitions columns should be last in the select
This is quite efficient way to reorganize your data.
After this you can delete source table and rename your target table.
I have an existing S3 folder structure like this,
s3://mydata/{country}/{date}/
{country} could be any of 30 different countries
{date} could be any date since 20150101
How can I read this in Hive by treating {country} as partition and {date} as sub partition ?
You can use the Hive DDL statement ALTER TABLE ADD PARTITION
ALTER TABLE mydata
ADD PARTITION (country='south-africa', date='20191024')
LOCATION 's3://mydata/south-africa/20191024/';
You can script this using a shell script, and passing each statement to Hive like hive -e 'ALTER TABLE $TABLE ADD PARTITION $PARTITION_SPEC LOCATION $PARTITION_LOCATION'
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions
I use hive-0.10.0-cdh-4.7.0 in my environment.
I have a table named test store as sequence file and some partitions by date_dim like below:
game=Test/date_dim=2014-07-01
game=Test/date_dim=2014-07-11
game=Test/date_dim=2014-07-21
game=Test/date_dim=2014-07-31
I want to drop partitions between 2014-07-21 and 2014-07-30 in SQL command:
alter table test drop partition (date_dim>='2014-07-11',date_dim<='2014-07-30')
I hope these 2 partitions be deleted:
game=Test/date_dim=2014-07-11
game=Test/date_dim=2014-07-21
But actually, these 3 partitions be deleted:
game=Test/date_dim=2014-07-01
game=Test/date_dim=2014-07-11
game=Test/date_dim=2014-07-21
It seems hive drop partition only use the date_dim<='2014-07-30' condition.
Is there anyway to make hive drop partition as I wish?
You should convert the string to the date type, for that purpose you can use unix_timestamp function:
alter table test drop partition (unix_timestamp(date_dim,'yyyy-MM-dd')>=unix_timestamp('2014-07-11','yyyy-MM-dd'),unix_timestamp(date_dim,'yyyy-MM-dd')<=unix_timestamp('2014-07-30','yyyy-MM-dd'))
I have created an external table in Hive with at this location :
CREATE EXTERNAL TABLE tb
(
...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/cloudera/data';
The data is present in the folder but when I query the table, it returns nothing. The table is structured in a way that it fits the data structure.
SELECT * FROM tb LIMIT 3;
Is there a kind of permission issue with Hive tables: do specific users have permissions to query some tables?
Do you know some solutions or workarounds?
You have created your table as partitioned table base on column datehour, but you are putting your data in /user/cloudera/data. Hive will look for data in /user/cloudera/data/datehour=(some int value). Since it is an external table hive will not update the metastore. You need to run some alter statement to update that
So here are the steps for external tables with partition:
1.) In you external location /user/cloudera/data, create a directory datehour=0909201401
OR
Load data using: LOAD DATA [LOCAL] INPATH '/path/to/data/file' INTO TABLE partition(datehour=0909201401)
2.) After creating your table run a alter statement:
ALTER TABLE ADD PARTITION (datehour=0909201401)
Hope it helps...!!!
When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE tb ADD PARTITION (datehour=0909201401)
hive> LOCATION '/user/cloudera/data/somedatafor_datehour'
hive> ;
When we specify LOCATION '/user/cloudera/data' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create, we can simply place the data file in that location like '/user/cloudera/data/datehour=0909201401/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE tb;
The above statement will sync up the partition to the hive meta store of the table "tb".
After adding a partition to an external table in Hive, how can I update/drop it?
You can update a Hive partition by, for example:
ALTER TABLE logs PARTITION(year = 2012, month = 12, day = 18)
SET LOCATION 'hdfs://user/darcy/logs/2012/12/18';
This command does not move the old data, nor does it delete the old data. It simply sets the partition to the new location.
To drop a partition, you can do
ALTER TABLE logs DROP IF EXISTS PARTITION(year = 2012, month = 12, day = 18);
in addition, you can drop multiple partitions from one statement (Dropping multiple partitions in Impala/Hive).
Extract from above link:
hive> alter table t drop if exists partition (p=1),partition (p=2),partition(p=3);
Dropped the partition p=1
Dropped the partition p=2
Dropped the partition p=3
OK
EDIT 1:
Also, you can drop bulk using a condition sign (>,<,<>), for example:
Alter table t
drop partition (PART_COL>1);
Alter table table_name drop partition (partition_name);
You can either copy files into the folder where external partition is located or use
INSERT OVERWRITE TABLE tablename1 PARTITION (partcol1=val1, partcol2=val2...)...
statement.
You may also need to make database containing table active
use [dbname]
otherwise you may get error (even if you specify database i.e. dbname.table )
FAILED Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter partition. Unable to alter partitions because table or database does not exist.