What is the user experience when I exchange partitions in Apache Hive?
Is it atomic or is it discrete and consists of multiple steps like:
partition rename
data copy
old partition drop
table repair
?
Related
Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).
I have a daily ingestion of data into HDFS . From data into HDFS I generate Hive tables partitioned by date and another column. One day has 130G data. After generate the data, I run msck repair. Now every msck tasks more than 2 hours. In my mind, msck will scan the whole table data (we have about 200 days data) and then update metadata. My question is: is there a way let the msck only scan the last day data and then update the metadata to speed up the whole process? by the way there is no small files issue, I already merge the small files before msck.
When you creating external table or doing repair/recover partitions with this configuration:
set hive.stats.autogather=true;
Hive scans each file in the table location to get statistics and it can take too much time.
The solution is to switch it off before create/alter table/recover partitions
set hive.stats.autogather=false;
See these related tickets: HIVE-18743, HIVE-19489, HIVE-17478
If you need statistics, you can gather statistics only for new partitions if necessary using
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
COMPUTE STATISTICS
See details here: ANALYZE TABLE
Also if you know which partitions should be added, use ALTER TABLE ADD PARTITION - you can add many partitions in single command.
Is there any number of partitions we would expect this command
MSCK REPAIR TABLE tablename;
to fail on?
I have a system that currently has over 27k partitions and the schema changes for the Athena table we drop the table, recreate the table with say the new column(s) tacked to the end and then run
MSCK REPAIR TABLE tablename;
We had no luck with this command doing any work what so every after we let it run for 5 hours. Not a single partition was added. Wondering if anyone has information about a partition limit we may have hit but can't find documented anywhere.
MSCK REPAIR TABLE is an extremely inefficient command. I really wish the documentation didn't encourage people to use it.
What to do instead depends on a number of things that are unique to your situation.
In the general case I would recommend writing a script that performed S3 listings and constructed a list of partitions with their locations, and used the Glue API BatchCreatePartition to add the partitions to your table.
When your S3 location contains lots of files, like it sounds yours does, I would either use S3 Inventory to avoid listing everything, or list objects with a delimiter of / so that I could list only the directory/partition structure part of the bucket and skip listing all files. 27K partitions can be listed fairly quickly if you avoid listing everything.
Glue's BatchCreatePartitions is a bit annoying to use since you have to specify all columns, the serde, and everything for each partition, but it's faster than running ALTER TABLE … ADD PARTION … and waiting for query execution to finish – and ridiculously faster than MSCK REPAIR TABLE ….
When it comes to adding new partitions to an existing table you should also never use MSCK REPAIR TABLE, for mostly the same reasons. Almost always when you add new partitions to a table you know the location of the new partitions, and ALTER TABLE … ADD PARTION … or Glue's BatchCreatePartitions can be used directly with no scripting necessary.
If the process that adds new data is separate from the process that adds new partitions, I would recommend setting up S3 notifications to an SQS queue and periodically reading the messages, aggregating the locations of new files and constructing the list of new partitions from that.
As an example consider I have a data of all the major sports events happened.Schema given below
EventName,Date,Month,Year,City
This data that is physically structured in HDFS on year,date,month.
Now I want to create virtual partitions on that based on some other column value, eg. City.The data will be stored physically in HDFS in year,date,month structure only but my metadata keeps track of the virtual partition.
Can hive metastore do it for me?
I don't think so it will happen. Actually partitioning in Hive means creates different dir for different partition. And metastore only contains metadata of table. It won't control the actual data. Technically when ever we query based on that partitioned column in Hive table, the query will execute on that exact partitioned dir only. So virtual partitioning with out changing hdfs structure in the sense the real data will be in one dir so the query has to be execute on entire data. So technically optimisation is not at all happening.
Can anyone please suggest how to create partition table in Big Query ?.
Example: Suppose I have one log data in google storage for the year of 2016. I stored all data in one bucket partitioned by year , month and date wise. Here I want create table with partitioned by date.
Thanks in Advance
Documentation for partitioned tables is here:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables
In this case, you'd create a partitioned table and populate the partitions with the data. You can run a query job that reads from GCS (and filters data for the specific date) and writes to the corresponding partition of a table. For example, to load data for May 1st, 2016 -- you'd specify the destination_table as table$20160501.
Currently, you'll have to run several query jobs to achieve this process. Please note that you'll be charged for each query job based on bytes processed.
Please see this post for some more details:
Migrating from non-partitioned to Partitioned tables
There are two options:
Option 1
You can load each daily file into separate respective table with name as YourLogs_YYYYMMDD
See details on how to Load Data from Cloud Storage
After tables created, you can access them either using Table wildcard functions (Legacy SQL) or using Wildcard Table (Standar SQL). See also Querying Multiple Tables Using a Wildcard Table for more examples
Option 2
You can create Date-Partitioned Table (just one table - YourLogs) - but you still will need to load each daily file into respective partition - see Creating and Updating Date-Partitioned Tables
After table is loaded you can easily Query Date-Partitioned Tables
Having partitions for an External Table is not allowed as for now. There is a Feature Request for it:
https://issuetracker.google.com/issues/62993684
(please vote for it if you're interested in it!)
Google says that they are considering it.