How to add partition in presto? - hive

In hive I can do it by:
ALTER TABLE xxx ADD PARTITION
(datehour='yy') LOCATION
'zz';
How can I do it in presto?

Currently, Presto Hive connector does not provide means for creating new partitions at arbitrary locations. If your partition location is under table location, you can use Presto Hive connector procedures:
system.create_empty_partition -- creates a new empty partition with specified values for partition keys
system.sync_partition_metadata -- synchronizes partition list in Metastore with the partitions on the storage
If you want to create/declare partitions somewhere else than under table's location, please file an issue.

Related

Hive metastore partition , how it works?

I have couple of query , please help me to understand
In Hive I see for couple of hive tables , Partitions information in cluster and in metastore are different what could be the reason ?
used "hive> show partitions " in Hive and " SELECT * FROM PARTITIONS WHERE TBL_ID=;" in metastore.
For some hive tables I see less number of partition information in Cluster but in metastore it is showing more partition . For this type of case when running query in hive tables using where clause for partition it is giving error that some partition are missing .
Where as there are some hive tables for which metastore has less number of partition information compare to cluster and in that case query is not giving error when running query using partition in where clause .
I suppose you are using Cloudera/Impala. The documentation says: If you believe an object exists but you cannot see it in the SHOW output, check with the system administrator if you need to be granted a new privilege for that object.
A table could span multiple different HDFS directories if it is partitioned. The directories could be widely scattered because a partition can reside in an arbitrary HDFS directory based on its LOCATION attribute.
See here: show partitions

Redshift Spectrum: Automatically partition tables by date/folder

We currently generate a daily CSV export that we upload to an S3 bucket, into the following structure:
<report-name>
|--reportDate-<date-stamp>
|-- part0.csv.gz
|-- part1.csv.gz
We want to be able to run reports partitioned by daily export.
According to this page, you can partition data in Redshift Spectrum by a key which is based on the source S3 folder where your Spectrum table sources its data. However, from the example, it looks like you need an ALTER statement for each partition:
alter table spectrum.sales_part
add partition(saledate='2008-01-01')
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2008-01/';
alter table spectrum.sales_part
add partition(saledate='2008-02-01')
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/';
Is there any way to set the table up so that data is automatically partitioned by the folder it comes from, or do we need a daily job to ALTER the table to add that day's partition?
Solution 1:
At max 20000 partitions can be created per table. You can create a one-time script to add the partitions (at max 20k) for all the future s3 partition folders.
For eg.
If folder s3://bucket/ticket/spectrum/sales_partition/saledate=2017-12/ doesn't exist, you can even add partition for that.
alter table spectrum.sales_part
add partition(saledate='2017-12-01')
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2017-12/';
Solution 2:
https://aws.amazon.com/blogs/big-data/data-lake-ingestion-automatically-partition-hive-external-tables-with-aws/
Another precise way to go about it:
Create a Lambda job that is triggered on the ObjectCreated notification from the S3 bucket, and run the SQL to add the partition:
alter table tblname ADD IF NOT EXISTS PARTITION (partition clause) localtion s3://mybucket/localtion

How to update a hive table's data after copied orc files with hdfs into the folder of that table

After insertion of orc files into the folder of a table with hdfs copy, how to update that hive table's data to see those data when querying with hive.
Best Regards.
If the table is not partitioned then once the files are in HDFS in the folder that is specified in the LOCATION clause, then the data should be available for querying.
If the table is partitioned then u first need to run an ADD PARTITION statement.
As mentioned in upper answer by belostoky. if the table is not partitioned then you can directly query your table with the updated data
But in case if you table is partitioned you need to add partitions first in hive table that you can do using
You can use alter table statement to add partitions like shown below
ALTER TABLE table1
ADD PARTITION (dt='<date>')
location '<hdfs file path>'
once partitions are added hive metastore should be aware of changes so you need to run
msck repair table table1
to add partitions in metastore.
Once done you can query your data

Understanding Partitioning in Hive

I am trying to learn Hive and while referring the The Hadoop Definitive Guide, I had some confusions.
As per the text, partition in Hive is done by creating sub-directories of the same values of partitioning column. But as in Hive data loading simply means copying of files, and no data validation checks are done during loading, but during querying only, so does Hive check the data for partitioning. Or how does it determine which file should go to which directory?
Or how does it determine which file should go to which directory?
It doesn't, you have to set the value of the destination partition in the LOAD DATA command. When you perform a LOAD operation into a partitioned table, you have to specify the specific partition (the directory) in which you are going to load the data by means of the PARTITION argument. According to the documentation:
The target being loaded to can be a table or a partition. If the table
is partitioned, then one must specify a specific partition of the
table by specifying values for all of the partitioning columns.
For instance, in this example:
hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');
The two files will be stored in the invites/ds=2008-08-15 and invites/ds=2008-08-08 folders.

hive 0.13 msck repair table only lists partitions not in metastore

I'm trying to use Hive(0.13) msck repair table command to recover partitions and it only lists the partitions not added to metastore instead of adding them to metastore as well.
here's the ouput of the command
partitions not in metastore externalexample:CreatedAt=26 04%3A50%3A56 UTC 2014/profileLocation="Chicago"
here's how I'm creating the external table
CREATE EXTERNAL TABLE IF NOT EXISTS ExternalExample(
tweetId BIGINT, username STRING,
txt STRING, CreatedAt STRING,
profileLocation STRING,
favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
location '/user/hue/exttable/';
Am I missing something?
I had a similar issue with the MSCK REPAIR TABLE listing the partitions that were not in the metastore but not actually adding them (and no error message).
I tried manually adding the partition with the ALTER TABLE ADD PARTITION command, and this gave me an error message, leading me to the root cause which was that the HDFS folder containing the 'missing' partition had been set up with incorrect permissions.
Once the permissions issue was resolved, then the MSCK REPAIR TABLE command worked correctly.
If you encounter this issue, it may be worthwhile to try adding it manually with the ALTER TABLE ADD PARTITION command. It may produce a useful error message that would help you determine the root cause of the problem.
Please make sure that the name of the partitions defined in your table definition match the name of the partition on hdfs.
For example, in your table creation example, I see that you haven't defined any paritions at all.
I think you want to do something like this (note the use of PARTITIONED BY):
create external table ExternalExample(tweetId BIGINT, username STRING, txt STRING,favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT) PARTITIONED BY (CreatedAt STRING, profileLocation STRING) COMMENT 'This is the Twitter streaming data' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE location '/user/hue/exttable/';
Then on hdfs you should have the following folder structure:
/user/hue/exttable/CreatedAt=<someString>/profileLocation=<someString>/your-data-file
The partition names for MSCK REPAIR TABLE ExternalTable should be in lowercase then only it will add it to hive metastore, I faced the similar issue in hive 1.2.1 where there was no support for ALTER TABLE ExternalTable RECOVER PARTITION, but after spending some time debugging found the issue that the partition names should be in lowercase i.e /some_external_path/mypartion=01 is valid and /some_external_path/myParition=01 is invalid;
Make your profileLocation to profilelocation or profile_location and test it should work.
My question is here Not able to recover partitions through alter table in Hive 1.2
Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (manually by hadoop fs -put command), the metastore will not be aware of these partitions.
you need to add partition
ALTER TABLE ExternalExample ADD PARTITION
for every partition
or in short you can run
MSCK REPAIR TABLE ExternalExample;
It will add any partitions that exist on HDFS but not in metastore to the metastore.
Ref https://issues.apache.org/jira/browse/HIVE-874
1) You need to specify partitions
2) Partition names must have all lower case letters . See this - https://singhanuvrat.com/hive-partition-column-name-camelcase-bad-idea-b89796d4e741#.16d7uqfot
you might not be running as the hive user:
sudo -u hive** hive -e "set hive.msck.path.validation=ignore;msck repair table T1"
set hive.msck.path.validation=ignore; ( this is for tables with large number of partitions.)
You are just missing the PARTITIONED BY (CreatedAt STRING, profileLocation STRING).