Permission denied on S3 path - What are the minimum policies required on the data source to get athena-express to work? - amazon-s3

I'm attempting to use the Athena-express node module to query Athena.
Per the Athena-express docs:
This IAM role/user must have AmazonAthenaFullAccess and AmazonS3FullAccess policies attached
Note: As an alternative to granting AmazonS3FullAccess you could granularize and limit write access to a specific bucket. Just specify this bucket name during athena-express initialization"
Providing AmazonS3FullAccess to this micro service is a non-starter. What is the minimum set of priviledges I can grant to the micro service and still get around the "Permission denied on S3 path: s3://..." errors I've been getting?
Currently, I've got the following
Output location: (I don't think the problem is here)
s3:AbortMultipartUpload, s3:CreateMultipartUpload, s3:DeleteObject, s3:Get*, s3:List*, s3:PutObject, s3:PutObjectTagging
on "arn:aws:s3:::[my-bucket-name]/tmp/athena" and "arn:aws:s3:::[my-bucket-name]/tmp/athena/*"
Data source location:
s3:GetBucketLocation
on "arn:aws:s3:::*"
s3:ListBucket
on "arn:aws:s3:::[my-bucket-name]"
s3:Get* and s3:List*
on "arn:aws:s3:::[my-bucket-name]/production/[path]/[path]" and "arn:aws:s3:::[my-bucket-name]/production/[path]/[path]/*"
The error message I get with the above is:
"Permission denied on S3 path: s3://[my-bucket-name]/production/[path]/[path]/v1/dt=2022-05-26/.hoodie_partition_metadata"
Any suggestions? Thanks!

It turned out that the bucket storing the data I needed to query was encrypted, which meant that the missing permission to query was kms:Decrypt.
Athena by outputs the results of a query to a location (which athena-express then retrieves). The location of the output was in that same encrypted bucket, so I also ended up giving my cronjob kms:Encrypt and kms:GeneratedDataKey.
I ended up using CloudTrails to figure out which permissions were causing my queries to fail.

Related

Content of directory on path https://xxxxxxx.dfs.core.windows.net/dataverse-xxxx-org5a2/account/Snapshot/2018-08_1656570292/*.csv' cannot be listed

When I try to query our Serverless SQL pool in Azure Synapse Analytics I get the following error:
"Content of directory on path 'https://xxxxxx.dfs.core.windows.net/dataverse-xxxxxx-org5a2bcccf/account/Snapshot/2018-08_1656570292/*.csv' cannot be listed.".
I have checked out the following link for clues as to what could be cause:
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/resources-self-help-sql-on-demand?tabs=x80070002
It is suggested that the error is due permissions:
However, I believe I have the correct permissons,
I get this error whether I try to execute the query in SSMS or Synapse Workspace.
The error in SSMS is as follows:
Warning: Unable to resolve path https://xxxxx.dfs.core.windows.net/dataverse-xxxxx-org5a2bcccf/account/Snapshot/2018-10_1657304551/*.csv. Error number 13807, Level 16, State 1, Message "Content of directory on path 'https://xxxxxx.dfs.core.windows.net/dataverse-xxxxx-org5a2bcccf/account/Snapshot/2018-10_1657304551/*.csv' cannot be listed.".
Can someone let me know how to resolve this?
The query that I'm attempting to execute can be located here:
https://github.com/slavatrofimov/Synapse-Link-for-Dataverse-data-enrichment-in-Serverless-SQL-Pools/blob/main/SQL/Enrich%20Synapse%20Link%20for%20Dataverse%20Entities%20with%20Human-Readable%20Labels.sql
Is there a definitive way to determine if the problem is due to lack of permissions?
Update Question:
I have just realised that the issue is access the Lake on https://xxxxxx.dfs.core.windows.net/dataverse-xxxxxx-org5a2bcccf/
Therefore please take a look at my permissons on the lake and let me know if it is sufficient?
This issue occurs when the user trying to query the external table does not have the relevant permissions or if there is a firewall enabled on your storage network.
When looked at the permissions you have provided, I see Storage Blob Data reader and Storage Blob Data contributor have been given.
Ref doc: Control storage account access for serverless SQL pool in Azure Synapse Analytics
In case if your storage account is firewall protect then you will have to follow the steps described in this document to overcome the issue: Access storage that is protected with the firewall
Here are couple of relevant articles which might help you configure your storage firewall to overcome this issue:
Storage configuration for external table is not accessible while query on Serverless
Synapse Studio error while trying to read data from Storage Account using SQL On Demand

Airflow Permission denied while getting Drive credentials

I am trying to run a bigquery query on Airflow with MWAA.
This query uses a table that is based on a Google Sheet. When I run it, I have the following error:
google.api_core.exceptions.Forbidden: 403 Access Denied: BigQuery BigQuery: Permission denied while getting Drive credentials.
I already have a working Google cloud connection on Airflow with an admin service account.
Also:
This service account has access to the google sheet
I added https://www.googleapis.com/auth/drive in the scopes of the Airflow connection
I re-generated a JSON file
Am I doing something wrong? Any idea what I can do to fix this problem?
Thanks a lot
I fixed my issue by creating a NEW Airflow connection. It's a new google cloud connection, with the exact same values as the default google_cloud_default values. Now it works perfectly.
Hope it can help !

Access Denied: Permission denied while getting Drive credentials

Since today our Airflow service is not able to access queries in BigQuery. All jobs fail with the following message:
[2021-03-12 10:17:28,079] {taskinstance.py:1150} ERROR - Reason: 403 GET https://bigquery.googleapis.com/bigquery/v2/projects/waipu-app-prod/queries/e62030d7-36eb-4420-b482-b5327f4f6c7e?maxResults=0&timeoutMs=900&location=EU: Access Denied: BigQuery BigQuery: Permission denied while getting Drive credentials.
We haven't changed anything in recent days. Therefore we are quite puzzled what the reason might be. Is there a temporary bug? Or might we have to check any settings?
Thanks & Best regards
Albrecht
I solved this by:
Giving the Airflow service account email access to Google Sheet where BigQuery table is derived from
Adding https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/drive to scopes in the Airflow connection
Regenerating the service account JSON keyfile and pasting into the Keyfile JSON in the Airflow connection

ADLS to Azure Storage Sync Using AzCopy

Looking for some help to resolve the errors I'm facing. Let me explain the scenario. I'm trying to sync one of the ADLS Gen2 container to Azure BLOB Storage. I have AzCopy 10.4.3, I'm using Azcopy Sync to do this. I'm using the command below
azcopy sync 'https://ADLSGen2.blob.core.windows.net/testsamplefiles/SAMPLE' 'https://AzureBlobStorage.blob.core.windows.net/testsamplefiles/SAMPLE' --recursive
When I run this command I'm getting below error
REQUEST/RESPONSE (Try=1/71.0063ms, OpTime=110.9373ms) -- RESPONSE SUCCESSFULLY RECEIVED
PUT https://AzureBlobStorage.blob.core.windows.net/testsamplefiles/SAMPLE/SampleFile.parquet?blockid=ZDQ0ODlkYzItN2N2QzOWJm&comp=block&timeout=901
X-Ms-Request-Id: [378ca837-d01e-0031-4f48-34cfc2000000]
ERR: [P#0-T#0] COPYFAILED: https://ADLSGen2.blob.core.windows.net/testsamplefiles/SAMPLE/SampleFile.parquet: 404 : 404 The specified resource does not exist.. When Staging block from URL. X-Ms-Request-Id: [378ca837-d01e-0031-4f48-34cfc2000000]
Dst: https://AzureBlobStorage.blob.core.windows.net/testsamplefiles/SAMPLE/SampleFile.parquet
REQUEST/RESPONSE (Try=1/22.9854ms, OpTime=22.9854ms) -- RESPONSE SUCCESSFULLY RECEIVED
GET https://AzureBlobStorage.blob.core.windows.net/testsamplefiles/SAMPLE/SampleFile.parquet?blocklisttype=all&comp=blocklist&timeout=31
X-Ms-Request-Id: [378ca84e-d01e-0031-6148-34cfc2000000]
So far I checked and ensured below things
I logged into correct tenant while logging into AzCopy
Storage Blob Data Contributor role was granted to my AD credentials
Not sure what else I'm missing as the file exists in the source and I'm getting the same error. I tried with SAS but I received different error though. I cannot proceed with SAS due to the vendor policy so I need to ensure this is working with oAuth. Any inputs is really appreciated.
For the 404 error, you may check if there is any typo in the command and the path /testsamplefiles/SAMPLE exists on both source and destination account. Also, please note that from the tips.
Use single quotes in all command shells except for the Windows Command
Shell (cmd.exe). If you're using a Windows Command Shell (cmd.exe),
enclose path arguments with double quotes ("") instead of single
quotes ('').
From azcopy sync supported scenario:
Azure Blob <-> Azure Blob (Source must include a SAS or is publicly
accessible; either SAS or OAuth authentication can be used for
destination)
We must provide include a SAS token in the source, but I tried the below code with AD authentication.
azcopy sync "https://[account].blob.core.windows.net/[container]/[path/to/blob]?[SAS]" "https://[account].blob.core.windows.net/[container]/[path/to/blob]"
but got the same 400 error as the Github issue.
Thus, in this case, after my validation, you could use this command to sync one of the ADLS Gen2 container to Azure BLOB Storage without executing azcopy login. If you have login in, you can run azcopy logout.
azcopy sync "https://nancydl.blob.core.windows.net/container1/sample?sv=xxx" "https://nancytestdiag244.blob.core.windows.net/container1/sample?sv=xxx" --recursive --s2s-preserve-access-tier=false

Get-AzureRmDataLakeStoreChildItem access issue

I am trying to run this powershell cmdlet :
Get-AzureRmDataLakeStoreChildItem -AccountName "xxxx" -Path "xxxxxx"
It fails with an access error. It does not really make sense because i have complete access to the ADLS account. I can browse in the Azure portal. It does not even work with a AzureRunAsConnection from an automation account. But it works perfectly for my colleague. What am i doing wrong?
Error :
Operation: LISTSTATUS failed with HttpStatus:Forbidden
RemoteException: AccessControlException LISTSTATUS failed with error
0x83090aa2 (Forbidden. ACL verification failed. Either the resource
does not exist or the user is not authorized to perform the requested
operation.).
[1f6e5d40-9be1-4682-84be-d538dfca0d19][2019-01-24T21:12:27.0252648-08:00]
JavaClassName: org.apache.hadoop.security.AccessControlException.
Last encountered exception thrown after 1 tries. [Forbidden (
AccessControlException LISTSTATUS failed with error 0x83090aa2
(Forbidden. ACL verification failed. Either the resource does not
exist or the user is not authorized to perform the requested
operation.).
I don't see any firewall restrictions :
I resolved the problem by providing read and execute access to all parent folders in the path. Since ADLS uses the POSIX standard, it does not inherit permissions from parent folders. So, even though the SPN(generated by the automation account) i was using had read/execute access to the specific folder i was interested in, it did not have access to other folders in that path.