AWS athena giving error when trying to query files in S3 that have already been catalogued in Glue data catalog

AWS athena giving error when trying to query files in S3 that have already been catalogued in Glue data catalog - amazon-s3

Trying to build a data lake using S3 for files that are in .csv.gz format and then further cleansing/processing data in AWS environment itself.
First used AWS Glue to create a data catalog\ (crawler was able to identify all tables).
The tables from catalog are also available in AWS Athena but when i try to run a Select * from the table it gives me following error.
Error opening Hive split s3://BUCKET_NAME/HEADER FOLDER/FILENAME.csv.gz (offset=0, length=44354) using org.apache.hadoop.mapred.TextInputFormat: Permission denied on S3 path: 3://BUCKET_NAME/HEADER FOLDER/FILENAME.csv.gz.
Could it be that the file is in CSV.GZ format and that is why it cannot be accessed as is or do i need to give user or role a specific access for these files?

You need to fix your permissions. The error says the principal (user/role) that ran the query does not have permission to read an object on S3.

Related

Query S3 Bucket With Amazon Athena and modify values

I have an S3 bucket with 500 csv files that are identical except for the number values in each file.
How do I write query that grabs dividendsPaid and make it positive for each file and send that back to s3?

Amazon Athena is a query engine that can perform queries on objects stored in Amazon S3. It cannot modify files in an S3 bucket. If you want to modify those input files in-place, then you'll need to find another way to do it.
However, it is possible for Amazon Athena to create a new table with the output files stored in a different location. You could use the existing files as input and then store new files as output.
The basic steps are:
Create a table definition (DDL) for the existing data (I would recommend using an AWS Glue crawler to do this for you)
Use CREATE TABLE AS to select data from the table and write it to a different location in S3. The command can include an SQL SELECT statement to modify the data (changing the negatives).
See: Creating a table from query results (CTAS) - Amazon Athena

When I run snowflake stage query I get aws error

I've created an s3 linked stage on snowflake called csv_stage with my aws credentials, and the creation was successful.
Now I'm trying to query the stage like below
select t.$1, t.$2 from #sandbox_ra.public.csv_stage/my_file.csv t
However the error I'm getting is
Failure using stage area. Cause: [The AWS Access Key Id you provided is not valid.]
Any idea why? Do I have to pass something in the query itself?
Thanks for your help!
Ultimately let's say my s3 location has 3 different csv files. I would like to load each one of them individually to different snowflake tables. What's the best way to go about doing this?

Regarding the last part of your question: You can load multiple files with one COPY INTO-command by using the file names or a certain regex-pattern. But as you have 3 different files for 3 different tables you also have to use three different COPY INTO-commands.
Regarding querying your stage you can find some more hints in these questions:
Missing List-permissions on AWS - Snowflake - Failure using stage area. Cause: [The AWS Access Key Id you provided is not valid.] and
https://community.snowflake.com/s/question/0D50Z00008EKjkpSAD/failure-using-stage-area-cause-access-denied-status-code-403-error-code-accessdeniedhow-to-resolve-this-error
https://aws.amazon.com/de/premiumsupport/knowledge-center/access-key-does-not-exist/

I found out the aws credential I provided was not right. After fixing that, query worked.

This approach works to import data from S3 into a snowgflake Table from a public S3 bucket:
COPY INTO SNOW_SCHEMA.table_name FROM 's3://test-public/new/solution/file.csv'

How can I search Json files stored in my s3 bucket from my local computer?

I have thousands of json files stored in my s3 bucket and need to perform a grep search for the string "name".
I have configured AWS cli ok as i can print out all the files in bucket (via the ls command).
I have tried the 2 following commands:
1)
aws s3 ls s3://training | grep 'name'
This resulted in nothing
2)
aws s3 cp s3://training/*json - | grep 'name'
This gave the error:
download failed: s3://training/*json to - An error occurred (404)
when calling the HeadObject operation: Not Found
I know the string name exists 100% as it is a field name that is stated multiple times in each json
Any ideas what I'm doing wrong?

Your first example fails because you are listing the objects, rather than printing out the content of the objects.
Your second example fails because you cannot use wildcards with S3 requests.
One way to do this would be to sync the files locally, then grep the local files, then delete the local files (or just leave them in place to optimize future syncs). You can use aws s3 sync to do this.
Another option would be to use Athena to query the JSON content. You can use SQL queries, for example.
Another option would be to create a search index when documents are uploaded to S3. You could trigger a Lambda function which reads the object content and indexes that into another S3 object, or a DynamoDB table, or even Elasticsearch if this is a significant system.

How to view what is being copied in SQL

I have JSON data in an Amazon Web Service S3 bucket. I am trying to copy it into a database (AWS Redshift).
I am using the following command:
COPY mytable FROM 's3://bucket/somedata'
iam_role 'arn:aws:iam::12345678:role/MyRole';
I am thinking the bucket's data is being copied with some additional meta data. I think the meta data is causing my COPY command to fail.
Can you tell me, is it possible to print the copied data somehow?
Thanks in advance!

If your COPY command fails, you should check stl_load_errors system table. It has raw_line column which which shows raw data that caused the failure. There are also other columns which will provide you with more details about the error.

Exporting data from Google Bigquery table to Google Cloud Storage

When exporting data from the Google bigquery table to Google cloud storage in Python, I get the error:
Access Denied: BigQuery BigQuery: Permission denied while writing
data.
I checked the JSON key file and it links to the owner of the storage. What can I do?

there are several reason's for this type of error
1. you give the exact path to the GOOGLE_APPLICATION_CREDENTIALS key.
2. Please check that you have writing permission in your project.
3. You have given a correct schema and their value if you writing a table, many of the times this type of error occurred due to incorrect schema value

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

AWS athena giving error when trying to query files in S3 that have already been catalogued in Glue data catalog - amazon-s3

You need to fix your permissions. The error says the principal (user/role) that ran the query does not have permission to read an object on S3.

Related

Query S3 Bucket With Amazon Athena and modify values

When I run snowflake stage query I get aws error

How can I search Json files stored in my s3 bucket from my local computer?

How to view what is being copied in SQL

Exporting data from Google Bigquery table to Google Cloud Storage

Categories

Resources