Modify partition column value - hive

Can I modify the value of a partitioned table, just by changing the name of the partition directory?
The table I now have has year and month as partitions. The values were stored as decimal so the partitions are "2016.0" instead of just "2016" and "3.0" instead of "3".
Can I just rename the directories and have the values in the partitions updated?

First rename the directories:
hadoop fs -mv /dev/year=2016.0 /dev/year=2016
hadoop fs -mv /dev/year=2016/month=4.0 /dev/year=2016/month=4
Let the hive metastore know about new location/partition:
ALTER TABLE logs PARTITION(year = 2014, month = 4)
SET LOCATION 'hdfs://dev/year=2016/month=4';

Related

Alter Hive partition column name NOT changed in HDFS

When alter the partition column name of the partition table(named partitioned_table), the corresponding directory in the HDFS does not change. However, the deletion and movement of partitions can be changed in the HDFS.And the the column name is changed using "show partitioin partitioned_table".
Hive version is 4.0.0-alpha-2.
Use the below statement to alter partiton column name.
ALTER TABLE table_name PARTITION
(partition_column = partition_col_value,
partition_column = partition_col_value)
RENAME TO PARTITION (partition_column = partition_col_value,
partition_column = partition_col_value);
Why and how to change the corresponding directory in HDFS when alter partition column name in Hive.
When you alter a partition, it only affects the Hive Metastore, and will never affect data in HDFS. For that, you need to explicitly insert data into the Hive table at that partition, or issue an hdfs mv command, then MSCK REPAIR Hive query to fix the metadata

How to create table over partitioned data

I have text file with snappy compression partitioned by field 'process_time' (result of Flume job). Example: hdfs://data/mytable/process_time=25-04-2019
This is my script for create table:
CREATE EXTERNAL TABLE mytable
(
...
)
PARTITIONED BY (process_time STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/mytable/'
TBLPROPERTIES("textfile.compress"="snappy");
The result of queries against this table are allways 0 (but I know that there are some data). Any help?
Thanks!
As you are creating external table on top of HDFS directory then to add the partitions to the hive table we need to run either of these commands.
if any partition added to HDFS directly(instead of using insert queries) then hive doesn't know about the newly added partitions, so we need to run either msck (or) add partitions to add newly added partitions to hive table.
To add all partitions to hive table:
hive> msck repair table <db_name>.<table_name>;
(or)
To manually add each partition to hive table:
hive> alter table <db_name>.<table_name> add partition(process_time="25-04-2019")
location '/data/mytable/process_time=25-04-2019';
For more details refer to this link.

Can i move data from one hive partition to another partition of the same table

My partition is based on year/month/date. Using SimpleDateFormat for week year created a wrong partition . The data for the date 2017-31-12 was moved to 2018-31-12 using YYYY in the date format.
SimpleDateFormat sdf = new SimpleDateFormat("YYYY-MM-dd");
So what I want is to move my data from partition 2018/12/31 to 2017/12/31 of the same table. I did not find any relevant documentation to do the same.
From what I understood, you would like to move the data from 2018-12-31 partition to 2017/12/31. Below is my explanation of how you can do it.
#From Hive/Beeline
ALTER TABLE TableName PARTITION (PartitionCol=2018-12-31) RENAME TO PARTITION (PartitionCol=2017-12-31);
FromSparkCode, You basically have to initiate the hiveContext and run the same HQL from it. You can refer one my answer here on how to initiate the hive Context.
#If you want to do on HDFS level, below is one of the approaches
#FromHive/beeline run the below HQL
ALTER TABLE TableName ADD IF NOT EXISTS PARTITION (PartitionCol=2017-12-31);
#Now from HDFS Just move the data in 2018 to 2017 partition
hdfs dfs -mv /your/table_hdfs/path/schema.db/tableName/PartitionCol=2018-12-31/* /your/table_hdfs/path/schema.db/tableName/PartitionCol=2017-12-31/
#removing the 2018 partition if you require
hdfs dfs -rm -r /your/table_hdfs/path/schema.db/tableName/PartitionCol=2018-12-31
#You can also drop from beeline/hive
alter table tableName drop if exists partition (PartitionCol=2018-12-31);
#At the end repair the table
msck repair table tableName
Why do i have to repair the table ??
There is a JIRA related to that https://issues.apache.org/jira/browse/SPARK-19187. Upgrade your spark version to 2.0.1 should fix the problem

Athena table with multiple locations

My data are distributed over multiple directories and multiple tab-separated files within those directories. The general structure looks like this:
s3://bucket_name/directory/{year}{month}/{iso_2}/{year}{month}{day}_table.bcp.gz
where {year} is the 4-digit year, {month} is the 2-digit month, {day} is the 2-digit day and {iso_2} is the ISO2 country code.
How do I set this up as a table in Athena?
Athena uses Hive DDL, so you just need to run a normal Hive create statement:
CREATE EXTERNAL TABLE table_name(
col_1 string,
...
col_n string)
PARTITIONED BY (
year_month string,
iso_2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION 's3://bucket_name/directory/';
Then register these directories as new partitions to the required table by running MSCK REPAIR TABLE table_name. If this fails for some reason (which it sometimes does in Athena) you'll need to run all the add partition statements for your existing directories:
ALTER TABLE table_name ADD PARTITION
(year_month=201601,iso=US) LOCATION 's3://bucket_name/directory/201601/US/';
ALTER TABLE table_name ADD PARTITION
(year_month=201602,iso=US) LOCATION 's3://bucket_name/directory/201602/US/';
ALTER TABLE table_name ADD PARTITION
(year_month=201601,iso=GB) LOCATION 's3://bucket_name/directory/201601/GB/';
etc.

S3 folder structure for hive partitioned tables without "=" in it

I have an existing S3 folder structure like this,
s3://mydata/{country}/{date}/
{country} could be any of 30 different countries
{date} could be any date since 20150101
How can I read this in Hive by treating {country} as partition and {date} as sub partition ?
You can use the Hive DDL statement ALTER TABLE ADD PARTITION
ALTER TABLE mydata
ADD PARTITION (country='south-africa', date='20191024')
LOCATION 's3://mydata/south-africa/20191024/';
You can script this using a shell script, and passing each statement to Hive like hive -e 'ALTER TABLE $TABLE ADD PARTITION $PARTITION_SPEC LOCATION $PARTITION_LOCATION'
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions