We currently generate a daily CSV export that we upload to an S3 bucket, into the following structure:
<report-name>
|--reportDate-<date-stamp>
|-- part0.csv.gz
|-- part1.csv.gz
We want to be able to run reports partitioned by daily export.
According to this page, you can partition data in Redshift Spectrum by a key which is based on the source S3 folder where your Spectrum table sources its data. However, from the example, it looks like you need an ALTER statement for each partition:
alter table spectrum.sales_part
add partition(saledate='2008-01-01')
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2008-01/';
alter table spectrum.sales_part
add partition(saledate='2008-02-01')
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/';
Is there any way to set the table up so that data is automatically partitioned by the folder it comes from, or do we need a daily job to ALTER the table to add that day's partition?
Solution 1:
At max 20000 partitions can be created per table. You can create a one-time script to add the partitions (at max 20k) for all the future s3 partition folders.
For eg.
If folder s3://bucket/ticket/spectrum/sales_partition/saledate=2017-12/ doesn't exist, you can even add partition for that.
alter table spectrum.sales_part
add partition(saledate='2017-12-01')
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2017-12/';
Solution 2:
https://aws.amazon.com/blogs/big-data/data-lake-ingestion-automatically-partition-hive-external-tables-with-aws/
Another precise way to go about it:
Create a Lambda job that is triggered on the ObjectCreated notification from the S3 bucket, and run the SQL to add the partition:
alter table tblname ADD IF NOT EXISTS PARTITION (partition clause) localtion s3://mybucket/localtion
Related
I have external table with complex datatype,(map(string,array(struct))) and I'm able to select and query this external table without any issue.
However if I am trying to load this data to a managed table, it runs forever. Is there any best approach to load this data to managed table in hive?
CREATE EXTERNAL TABLE DB.TBL(
id string ,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>>
) LOCATION <path>
BTW, you can convert table to managed (though this may not work on cloudera distribution due warehouse dir restriction):
use DB;
alter table TBLSET TBLPROPERTIES('EXTERNAL'='FALSE');
If you need to load into another managed table, you can simply copy files into it's location.
--Create managed table (or use existing one)
use db;
create table tbl_managed(id string,
list map<string,array<struct<ID:string,col:boolean,col2:string,col3:string,col4:string>>> ) ;
--Check table location
use db;
desc formatted tbl_managed;
This will print location along with other info, use it to copy files.
Copy all files from external table location into managed table location, this will work most efficiently, much faster than insert..select:
hadoop fs -cp external/location/path/* managed/location/path
After copying files, table will be selectable. You may want to analyze table to compute statistics:
ANALYZE TABLE db_name.tablename COMPUTE STATISTICS [FOR COLUMNS]
Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).
I am new to Hive and have a below basic question:
I am trying to create external table on HDFS directory at location
/projects/score/output/scores_2020-06-30.gzip
but it is not considering it as partition.
Should developer need to change directory name "scores=yyyy-mm-dd" in place of "scores_yyyy-mm-dd.gzip"
like "/projects/score/output/scores=2020-06-30"
then only it would consider as partitioned?
i.e. Is it mandatory to have '=' for external table to consider as partition
Or can I change something in below table while creation. Trying as below:
CREATE EXTERNAL TABLE IF NOT EXISTS XYZ (
...
)
PARTITIONED BY (scores STRING)
LOCATION '/projects/score/output/';
Thanks in advance!
You can define partition on top of any location, even outside table directory using ALTER TABLE ADD PARTITION. Partition in HDFS is a directory usually inside table location but not necessarily. If it is inside table directory, then you can use msck repair table to attach existing subdirestories inside table directory as partitions, it will scan table location and add partitions metadata.
In your example partition directory is missing, you have only table directory with a file inside. Filename does not matter in this case.
It is not absolutely mandatory to have partition directory in the format key=value, though msck repair table may not work in your Hive distribution, you still can add partitions using ALTER TABLE ADD PARTITION ... LOCATION command on top of any directory.
It may depend on vendor. For example on Qubole, ALTER TABLE RECOVER PARTITIONS(EMR alternative of MSCK REPAIR TABLE) works fine with directories like '2020-06-30'.
By default when inserting data using dynamic partitioning, it creates partition folders in the format key=value, but if you creating partition directories using some other tools, 'value' as partition folder name is okay. Just check does MSCK REPAIR work or not in your case. If it does not, create directories key=value if you need MSCK REPAIR.
The name of file(s) and the number of files inside partition folder does not matter in this context.
After insertion of orc files into the folder of a table with hdfs copy, how to update that hive table's data to see those data when querying with hive.
Best Regards.
If the table is not partitioned then once the files are in HDFS in the folder that is specified in the LOCATION clause, then the data should be available for querying.
If the table is partitioned then u first need to run an ADD PARTITION statement.
As mentioned in upper answer by belostoky. if the table is not partitioned then you can directly query your table with the updated data
But in case if you table is partitioned you need to add partitions first in hive table that you can do using
You can use alter table statement to add partitions like shown below
ALTER TABLE table1
ADD PARTITION (dt='<date>')
location '<hdfs file path>'
once partitions are added hive metastore should be aware of changes so you need to run
msck repair table table1
to add partitions in metastore.
Once done you can query your data
I have created a hive table with dynamic partitioning on a column. Is there a way to directly load the data from files using "LOAD DATA" statement? Or do we have to only depend on creating a non-partitioned intermediate table and load file data to it and then inserting data from this intermediate table to partitioned table as mentioned in Hive loading in partitioned table?
No, the LOAD DATA command ONLY copies the files to the destination directory. It doesn't read the records of the input file, so it CANNOT do partitioning based on record values.
If your input data is already split into multiple files based on partitions, you could directly copy the files to table location in HDFS under their partition directory manually created by you (OR just point to their current location in case of EXTERNAL table) and use the following ALTER command to ADD the partition. This way you could skip the LOAD DATA statement altogether.
ALTER TABLE <table-name>
ADD PARTITION (<...>)
No other go, if we need to insert directly, we'll need to specify partitions manually.
For dynamic partitioning, we need staging table and then insert from there.