Deleting a file from S3 does not delete it from Athena? - sql

When I add a file to S3, run a query against Athena, Athena returns the expected result with the data from this file.
Now if I then delete that same file from S3 and run the same query, Athena still returns the same data even though the file is not in S3 anymore.
Is this the expected behaviour? I thought Athena calls out to S3 on every query, but I'm now starting to think there is some sort of caching going on?
Does anyone have any ideas? I can't find any information online about this.
Thanks for the help in advance!

Athena (Hive)/Glue load partitions with a frequency. If you want to load latest result you need run
MSCK REPAIR TABLE table_name;
to refresh Athena caches.

Thanks for the help guys.
I actually was looking at the wrong files in S3 and the files I thought were removed were still present. Once I deleted them from S3, the query against Athena returned the expected results immediately.
Thanks!

Related

Looking for a safe way to delete a parquet file from Delta Lake

as the title says I'm looking for a safe way to delete parquet files on a Delta Lake, correct because I don't want to corrupt my Delta Lake.
This is what happened, some days ago some partitions had to be backfilled but forgot to delete the contents of one of those prior to this process, delete/refill is the last resource I want to try right now.
I know by the last modification date which files I need to delete, but if I delete them from S3 I will corrupt the partition/table and generate an inconsistency against the manifest.
What would be the "uncorrupting" way to delete those files?
I'm currently working on spark.
Thanks!

What is the easiest way to upload data from S3 to Redshift?

I'm looking for a simple way to load data from S3 to Redshift.
I've tried AWS Glue and Firehouse, without success.
EDIT:
As right now it's not the best way to do it but AWS Glue is working. I'll revist the COPY command to try to get better results!
Thanks guys!
The simplest solution is to use the COPY command, e.g.
create table my_table(...);
copy my_table
from 's3://my_bucket/my_prefix/data.txt'
iam_role 'arn:aws:iam::<aws-account-id>:role/<role-name>'
region 'us-west-2';
By default, the data file should contain pipe-separated plain text columns. There are plenty more options: JSON, Parquet, using a manifest file to load from multiple files, etc.
UNLOAD is the reverse command (dumping a table to S3).

How to view what is being copied in SQL

I have JSON data in an Amazon Web Service S3 bucket. I am trying to copy it into a database (AWS Redshift).
I am using the following command:
COPY mytable FROM 's3://bucket/somedata'
iam_role 'arn:aws:iam::12345678:role/MyRole';
I am thinking the bucket's data is being copied with some additional meta data. I think the meta data is causing my COPY command to fail.
Can you tell me, is it possible to print the copied data somehow?
Thanks in advance!
If your COPY command fails, you should check stl_load_errors system table. It has raw_line column which which shows raw data that caused the failure. There are also other columns which will provide you with more details about the error.

AWS Athena fails when there are empty files in S3

I have a data pipeline that copies data partitioned by date. Sometimes there is no data for a day. The datapipeline creates a 0 bytes csv file. When I run an Athena query for this date it fails instead of returning 0 results. The error I get is
HIVE_CURSOR_ERROR: Unexpected end of input stream
How can I avoid this. I understand one way is to never create files with empty data but I could never figure out how to do that in a data pipeline. Is there anything I can tweak in Athena so that it does not fail this way?
Try running the below command after your data has been copied by data pipeline.
MSCK REPAIR TABLE table_name
This would recover \ update the partitions to the Athena catalog.
It can be the last step in your data pipeline. Before you actually make it part of your pipeline, try executing it in the Athena Query console and verify if it resolves the issue.

AWS Athena: does `msck repair table` incur costs?

I have ORC data in S3 that looks like this:
s3://bucket/orc/clientId=client-1/year=2017/month=3/day=16/hour=20/
s3://bucket/orc/clientId=client-2/year=2017/month=3/day=16/hour=21/
s3://bucket/orc/clientId=client-3/year=2017/month=3/day=16/hour=22/
Every hour I run an EMR job that converts raw JSON in S3 to ORC, and write it out with the path partition convention (above) for Athena ingestion. After the EMR job completes, I run msck repair table so Athena can pick up the new partitions.
I have 3 related questions:
Does running msck repair table in this scenario, cost me money in AWS?
AWS Docs say msck repair table can timeout. Is there a way I can make a step in data pipeline to continue running this command until it completes successfully?
I would prefer to add the partitions manually to Athena (since I know the year,month,day,hour I'm working on). However I do not know the clientId because there could be 1-X of them, and I don't know which ones exist at time of running EMR. Is there a best practice way to solve this problem (using Hive or something else)? I could make an s3 api call to get a list of s3://bucket/org/ and write code to iterate over list and add manually. I'm hoping there is an easier way...
Note: when I say "add partitions manually" I mean doing something like this:
ALTER TABLE <athena table>
ADD PARTITION (clientId='client-1',year=2017,month=3,day=16,hour=20)
location 's3://bucket/orc/clientId=client-1/year=2017/month=3/day=16/hour=20/';
AWS says:
There's no charge for DDL queries or for partition detection.
AWS says:
S3 GET charges do apply.
I do not yet know how to automate msck repair table to make sure it completes.