AWS Athena: does `msck repair table` incur costs? - hive

I have ORC data in S3 that looks like this:
s3://bucket/orc/clientId=client-1/year=2017/month=3/day=16/hour=20/
s3://bucket/orc/clientId=client-2/year=2017/month=3/day=16/hour=21/
s3://bucket/orc/clientId=client-3/year=2017/month=3/day=16/hour=22/
Every hour I run an EMR job that converts raw JSON in S3 to ORC, and write it out with the path partition convention (above) for Athena ingestion. After the EMR job completes, I run msck repair table so Athena can pick up the new partitions.
I have 3 related questions:
Does running msck repair table in this scenario, cost me money in AWS?
AWS Docs say msck repair table can timeout. Is there a way I can make a step in data pipeline to continue running this command until it completes successfully?
I would prefer to add the partitions manually to Athena (since I know the year,month,day,hour I'm working on). However I do not know the clientId because there could be 1-X of them, and I don't know which ones exist at time of running EMR. Is there a best practice way to solve this problem (using Hive or something else)? I could make an s3 api call to get a list of s3://bucket/org/ and write code to iterate over list and add manually. I'm hoping there is an easier way...
Note: when I say "add partitions manually" I mean doing something like this:
ALTER TABLE <athena table>
ADD PARTITION (clientId='client-1',year=2017,month=3,day=16,hour=20)
location 's3://bucket/orc/clientId=client-1/year=2017/month=3/day=16/hour=20/';

AWS says:
There's no charge for DDL queries or for partition detection.
AWS says:
S3 GET charges do apply.
I do not yet know how to automate msck repair table to make sure it completes.

Related

BigQuery Scheduled Query won't run

I've got data buckets setup in GCS and using BigQuery to run all my .csv files from that bucket to build a table. That works flawlessly. I made a simple deduplication query that when manually run, selects only distinct rows and creates a new table with "DeDupe" appended (Code below). That runs flawlessly.
CREATE OR REPLACE TABLE
`project-name-123456.dataset_2022.dataset 2022 DeDuped` AS
SELECT
DISTINCT *
FROM
`project-name-123456.dataset_2022.dataset 2022`
The issue I am having is with scheduling that query. Every time it tries to run I get the error "Error status: Not found: Dataset project-name-123456:dataset_2022 was not found in location US; JobID: project-name-123456:628d7766-0000-2d36-a82f-94eb2c0a664a"
The only thing I can figure is that I have my data location for the dataset as "us-central1" as it has a free tier. And when I go to my scheduled query, whether I select the same data location, or "Default" it always changes to "US Multiple".
Is there a way to fix this?
Or do I need to create my dataset in "US Multiple"?
Trying to cut down on costs as much as possible by keeping it in the us-central1
EDIT: Seems like I just needed to delete and recreate the scheduled query again. Chatted with Google Support and they sorted it. Sorry all!

How to fix corrupted delta lake table on AWS S3

I ended up manually deleting some delta lake entries(hosted on S3) .
Now my spark job is failing because the delta transaction logs point to files that do not exist in the file system.
I came across this https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-fsck.html
but I am not sure how should I run this utility in my case.
You could easily do that following the document that you have attached.
I have done that as below if you have hive table on top of your S3:
%sql
FSCK REPAIR TABLE schema.testtable DRY RUN
Using DRY RUN will list the files that needs to be deleted. You can first run the above command and verify the files that actually need to be deleted.
Once you have verified that you can run the actual above command without DRY RUN and it should do what you needed.
%sql
FSCK REPAIR TABLE schema.testtable
Now if you have not created a hive table and have a path(delta table) where you have files than you can do it as below:
%sql
FSCK REPAIR TABLE delta.`dbfs:/mnt/S3bucket/tables/testtable` DRY RUN
I am doing this from databricks and have mounted my S3 bucket path to databricks.
you need to make sure that you have that ` symbol after delta. and before the actual path otherwise it wont work.
here also in order to perform the actual repair operation you can remove the DRY RUN from the above command and it should do the stuff that you wat.

AWS Athena MSCK REPAIR TABLE tablename command

Is there any number of partitions we would expect this command
MSCK REPAIR TABLE tablename;
to fail on?
I have a system that currently has over 27k partitions and the schema changes for the Athena table we drop the table, recreate the table with say the new column(s) tacked to the end and then run
MSCK REPAIR TABLE tablename;
We had no luck with this command doing any work what so every after we let it run for 5 hours. Not a single partition was added. Wondering if anyone has information about a partition limit we may have hit but can't find documented anywhere.
MSCK REPAIR TABLE is an extremely inefficient command. I really wish the documentation didn't encourage people to use it.
What to do instead depends on a number of things that are unique to your situation.
In the general case I would recommend writing a script that performed S3 listings and constructed a list of partitions with their locations, and used the Glue API BatchCreatePartition to add the partitions to your table.
When your S3 location contains lots of files, like it sounds yours does, I would either use S3 Inventory to avoid listing everything, or list objects with a delimiter of / so that I could list only the directory/partition structure part of the bucket and skip listing all files. 27K partitions can be listed fairly quickly if you avoid listing everything.
Glue's BatchCreatePartitions is a bit annoying to use since you have to specify all columns, the serde, and everything for each partition, but it's faster than running ALTER TABLE … ADD PARTION … and waiting for query execution to finish – and ridiculously faster than MSCK REPAIR TABLE ….
When it comes to adding new partitions to an existing table you should also never use MSCK REPAIR TABLE, for mostly the same reasons. Almost always when you add new partitions to a table you know the location of the new partitions, and ALTER TABLE … ADD PARTION … or Glue's BatchCreatePartitions can be used directly with no scripting necessary.
If the process that adds new data is separate from the process that adds new partitions, I would recommend setting up S3 notifications to an SQS queue and periodically reading the messages, aggregating the locations of new files and constructing the list of new partitions from that.

How to handle hive locking across hive and presto

I have a few hive tables that are insert-overwrite from spark and hive. Those tables are also accessed by analysts on presto. Naturally, we're running into some windows of time that users are hitting an incomplete data set because presto is ignoring locks.
The options I can think of:
Fork the presto-hive connector to support hive S and X locks appropriately. This isn't too bad, but time consuming to do properly.
Swap the table location on the hive metastore once an insert overwrite is complete. This is OK, but a little messy because we like to store explicit locations at the database level and let the tables inherit location.
Stop doing insert-overwrite on these tables and instead just add a new partition for the things that have changed, then once a new partition is written, alter the hive table to see it. Then we can just have views on top of the data that will properly reconcile the latest version of each row.
Stop doing insert-overwrite on s3 which has a long window of copy from hive staging to the target table. If we move to hdfs for all insert-overwrite, we still have the issue, but it's over the span of time that it takes to do a hdfs mv which is significantly faster. (probably bad: there's still a window that we can get incomplete data)
My question is how do people generally handle that? It seems like a common scenario that would have an explicit solution, but I seem to be missing it. This can be asked in general for any third party tool that can query the hive metastore and interact with the hdfs/s3 directly while not respecting hive locks.

Loading or pointing to multiple parquet paths for data analysis with hive or prestodb

I have couple of spark jobs that produce parquet files in AWS S3. Every once in a while i need to run some ad-hoc queries on a given date range of this data. I don't want to do this in spark because I want our QA team which has no knowledge os spark be able to do this. What i like to do is to spin up an AWS EMR cluster and load the parquet files into HDFS and run my queries against it. I have figured out how to create tables with hive and point it to one s3 path. But then that limits my data to only one day. because each day of date has multiple files under a path like
s3://mybucket/table/date/(parquet files 1 ... n).
So problem one is to figure how to load multiple days of data into hive. ie
s3://mybucket/table_a/day_1/(parquet files 1 ... n).
s3://mybucket/table_a/day_2/(parquet files 1 ... n).
s3://mybucket/table_a/day_3/(parquet files 1 ... n).
...
s3://mybucket/table_b/day_1/(parquet files 1 ... n).
s3://mybucket/table_b/day_2/(parquet files 1 ... n).
s3://mybucket/table_b/day_3/(parquet files 1 ... n).
I know hive can support partitions but my s3 files are not setup that way.
I have also looked into prestodb which looks like to be the favorite tool for this type of data analysis. The fact it supports ansi SQL makes it a great tool for people that have SQL knowledge but know very little about hadoop or spark. I did install this on my cluster and it works great. But looks like you can't really load data into your tables and you have to rely on Hive to do that part. Is this the right way to use prestodb? I watched a netflix presentation about their use of prestodb and using s3 in place of HDFS. If this works its great but i wonder how the data is moved into memory. At what point the parquet files will be moved from s3 to the cluster. Do i need to have cluster that can load the entire data into memory? how is this generally setup?
You can install Hive and create Hive tables with you data in S3, described in the blog post here: https://blog.mustardgrain.com/2010/09/30/using-hive-with-existing-files-on-s3/
Then install Presto on AWS, configure Presto to connect the hive catalog which you installed previously. Then you can query the your data on S3, with Presto by using SQL.
Rather than trying to load multiple files, you could instead use the API to concatenate the days you want into a single object, which you can then load through the means you already mention.
AWS has a blog post highlighting how to do this exact thing purely through the API (without downloading + re-uploading the data):
https://ruby.awsblog.com/post/Tx2JE2CXGQGQ6A4/Efficient-Amazon-S3-Object-Concatenation-Using-the-AWS-SDK-for-Ruby