msck repair a big table take very long time - hive

I have a daily ingestion of data into HDFS . From data into HDFS I generate Hive tables partitioned by date and another column. One day has 130G data. After generate the data, I run msck repair. Now every msck tasks more than 2 hours. In my mind, msck will scan the whole table data (we have about 200 days data) and then update metadata. My question is: is there a way let the msck only scan the last day data and then update the metadata to speed up the whole process? by the way there is no small files issue, I already merge the small files before msck.

When you creating external table or doing repair/recover partitions with this configuration:
set hive.stats.autogather=true;
Hive scans each file in the table location to get statistics and it can take too much time.
The solution is to switch it off before create/alter table/recover partitions
set hive.stats.autogather=false;
See these related tickets: HIVE-18743, HIVE-19489, HIVE-17478
If you need statistics, you can gather statistics only for new partitions if necessary using
ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
COMPUTE STATISTICS
See details here: ANALYZE TABLE
Also if you know which partitions should be added, use ALTER TABLE ADD PARTITION - you can add many partitions in single command.

Related

Add new partition-scheme to existing table in athena with SQL code

Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).

AWS Athena MSCK REPAIR TABLE tablename command

Is there any number of partitions we would expect this command
MSCK REPAIR TABLE tablename;
to fail on?
I have a system that currently has over 27k partitions and the schema changes for the Athena table we drop the table, recreate the table with say the new column(s) tacked to the end and then run
MSCK REPAIR TABLE tablename;
We had no luck with this command doing any work what so every after we let it run for 5 hours. Not a single partition was added. Wondering if anyone has information about a partition limit we may have hit but can't find documented anywhere.
MSCK REPAIR TABLE is an extremely inefficient command. I really wish the documentation didn't encourage people to use it.
What to do instead depends on a number of things that are unique to your situation.
In the general case I would recommend writing a script that performed S3 listings and constructed a list of partitions with their locations, and used the Glue API BatchCreatePartition to add the partitions to your table.
When your S3 location contains lots of files, like it sounds yours does, I would either use S3 Inventory to avoid listing everything, or list objects with a delimiter of / so that I could list only the directory/partition structure part of the bucket and skip listing all files. 27K partitions can be listed fairly quickly if you avoid listing everything.
Glue's BatchCreatePartitions is a bit annoying to use since you have to specify all columns, the serde, and everything for each partition, but it's faster than running ALTER TABLE … ADD PARTION … and waiting for query execution to finish – and ridiculously faster than MSCK REPAIR TABLE ….
When it comes to adding new partitions to an existing table you should also never use MSCK REPAIR TABLE, for mostly the same reasons. Almost always when you add new partitions to a table you know the location of the new partitions, and ALTER TABLE … ADD PARTION … or Glue's BatchCreatePartitions can be used directly with no scripting necessary.
If the process that adds new data is separate from the process that adds new partitions, I would recommend setting up S3 notifications to an SQS queue and periodically reading the messages, aggregating the locations of new files and constructing the list of new partitions from that.

Schedule deleting a BQ table

I am streaming data into BQ, every day I run a scheduled job in Dataprep that takes 24 hours of data and modifies some data and creates a new table in the BQ dataset with 24 hours of data.
The original table though stays unmodified and keeps on gathering data.
What I would like to do is delete all rows in the table after the dataprep makes a copy so that a new 24 hours of data streaming is gathered
How can I make this automated, I can't seem to find anything in dataprep that drops the original table and creating a new table.
You can do this setting up your table as partitioned table due to you are ingesting data constantly.
This option is to do it manually:
bq rm '[YOUR_DATASET].[YOUR_TABLE]$xxxxxxx'
And with the expiration time you can set the time when the data of the table will be deleted:
bq update --time_partitioning_expiration [INTEGER] [YOUR_PROJECT_ID]:[YOUR_DATASET].[YOUR_TABLE]
You could use a Scheduled Query to clear out the table:
https://cloud.google.com/bigquery/docs/scheduling-queries
Scheduled queries support DDL so you could schedule a query that deletes all the rows from this table, or deletes the table entirely, on a daily basis. at a specific time.

What happens to a user query while I exchange partitions in Hive?

What is the user experience when I exchange partitions in Apache Hive?
Is it atomic or is it discrete and consists of multiple steps like:
partition rename
data copy
old partition drop
table repair
?

MSCK REPAIR hive external tables

I have a daily ingestion of data in to HDFS .
From data into HDFS I generate Hive external tables partitioned by date .
My qestion is as follows , should I run MSCK REPAIR TABLE tablename after each data ingestion , in this case I have to run the command each day.
Or running it just one time at the table creation is enough .
Thanks a lot for your answers
Best regards
You only run MSCK REPAIR TABLE while the structure or partition of the external table is changed. This command updates the metadata of the table.
One example that usually happen, e.g.
You use a field dt which represent a date to partition the table.
Yesterday, you inserted some data which is dt=2018-06-12, then you should run MSCK REPAIR TABLE to update the metadata to tell hive to aware a new partition dt=2018-06-12.
Today, you insert some data which is dt=2018-06-13, then you should run MSCK REPAIR TABLE to update the metadata to tell hive to aware a new partition dt=2018-06-13.