Authentication Failure when Accessing Azure Blob Storage through Connection String - azure-storage

We got error of Authentication fail, when we try to create an azure blob client from connection string, using python v12 sdk with Azure Blob Storage v12.5.0, and Azure core 1.8.2.
I used
azure-storate-blob == 12.5.0
azure-core == 1.8.2
I tried to access my blob storage account using connection string with Python v12 SDK and received the error above. The environment I'm running in is python venv in NixShell.
The code for calling the blob_upload is as following:
blob_service_client = BlobServiceClient(account_url=<>,credential=<>)
blob_client = blob_service_client.get_blob_client(container=container_name,
blob=file)
I printed out blob_client, and it looks normal. But the next line of upload_blob gives error.
with open(os.path.join(root,file), "rb") as data:
blob_client.upload_blob(data)
The error message is as follows
File "<local_address>/.venv/lib/python3.8/site-packages/azure/storage/blob/_upload_helpers.py", in upload_block_blob
return client.upload(
File "<local_address>/.venv/lib/python3.8/site-packages/azure/storage/blob/_generated/operations/_block_blob_operations.py", in upload
raise models.StorageErrorException(response, self._deserialize)
azure.storage.blob._generated.models._models_py3.StorageErrorException: Operation returned an invalid status 'Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.'
So I printed out the http put request to azure blob storage, and get the response value of [403]

I can work the following code well with the version the same as yours.
from azure.storage.blob import BlobServiceClient
blob=BlobServiceClient.from_connection_string(conn_str="your connect string in Access Keys")
with open("./SampleSource.txt", "rb") as data:
blob.upload_blob(data)
Please check your connect-string, and check your PC's time.
There is a similar issue about the error: AzureStorage Blob Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature
UPDATE:
I tried with this code, and get the same error:
from azure.storage.blob import BlobServiceClient
from azure.identity import DefaultAzureCredential
token_credential = DefaultAzureCredential()
blob_service_client = BlobServiceClient(account_url="https://pamelastorage123.blob.core.windows.net/",credential=token_credential)
blob_client = blob_service_client.get_blob_client(container="pamelac", blob="New Text Document.txt")
with open("D:/demo/python/New Text Document.txt", "rb") as data:
blob_client.upload_blob(data)
Then I use AzureCliCredential() instead of DefaultAzureCredential(). I authenticate via the Azure CLI with az login. And it works.
If you use environment credential, you need to set the variables. Anyway, I recommend you to use the specific credentials instead DefaultAzureCredential.
For more details about Azure Identity, see here.

Related

URI Format for Creating an Airflow S3 Connection via Environment Variables

I've read the documentation for creating an Airflow Connection via an environment variable and am using Airflow v1.10.6 with Python3.5 on Debian9.
The linked documentation above shows an example S3 connection of s3://accesskey:secretkey#S3 From that, I defined the following environment variable:
AIRFLOW_CONN_AWS_S3=s3://#MY_ACCESS_KEY#:#MY_SECRET_ACCESS_KEY##S3
And the following function
def download_file_from_S3_with_hook(key, bucket_name):
"""Get file contents from S3"""
hook = airflow.hooks.S3_hook.S3Hook('aws_s3')
obj = hook.get_key(key, bucket_name)
contents = obj.get()['Body'].read().decode('utf-8')
return contents
However, when I invoke that function I get the following error:
Using connection to: id: aws_s3.
Host: #MY_ACCESS_KEY#,
Port: None,
Schema: #MY_SECRET_ACCESS_KEY#,
Login: None,
Password: None,
extra: {}
ERROR - Unable to locate credentials
It appears that when I format the URI according to Airflow's documentation it's sitting the access key as the host and the secret access key as the schema.
It's clearly reading the environment variable as it has the correct conn_id. It also has the correct values for my access key and secret, it's just parcing it under the wrong field.
When I set the connection in the UI, the function works if I set Login to my access key and Password to my token. So how am I formatting my environment variable URI wrong?
Found the issue, s3://accesskey:secretkey#S3 is the correct format, the problem was my aws_secret_access_key had a special character in it and had to be urlencoded. That fixed everything.

I am working on a databricks notebook running with ADLSV2 using service principle id but receive the following error after mounting my drive

I am working on a databricks notebook running with ADLSV2 using service
priciple id but receive the following error after mounting my drive.
StatusCode=403
StatusDescription=This request is not authorized to perform this operation using this permission.
configs = {"dfs.adls.oauth2.access.token.provider.type":
"ClientCredential",
"dfs.adls.oauth2.client.id": "78jkj56-2ght-2345-3453-b497jhgj7587",
"dfs.adls.oauth2.credential": dbutils.secrets.get(scope =
"DBRScope", key = "AKVsecret"),
"dfs.adls.oauth2.refresh.url":
"https://login.microsoftonline.com/bdef8a20-aaac-4f80-b3a0-
d9a32f99fd33/oauth2/token"}
dbutils.fs.mount(source =
"adl://<accountname>.azuredatalakestore.net/tempfile",mount_point =
"/mnt/tempfile",extra_configs = configs)
%fs ls mnt/tempfile
The uri for your lake is a gen1 uri not gen2. Either way your service principal does not have permission to access the lake. As a test make it a resource owner, then remove it and work out what permissions are missing.

Pyspark not using TemporaryAWSCredentialsProvider

I'm trying to read files from S3 using Pyspark using temporary session credentials but keep getting the error:
Received error response: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: null, AWS Request ID: XXXXXXXX, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: XXXXXXX
I think the issue might be that the S3A connection needs to use org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider in order to pull in the session token in addition to the standard access key and secret key, but even with setting the fs.s3a.aws.credentials.provider configuration variable, it is still attempting to authenticate with BasicAWSCredentialsProvider. Looking at the logs I see:
DEBUG AWSCredentialsProviderChain:105 - Loading credentials from BasicAWSCredentialsProvider
I've followed the directions here to add the necessary configuration values, but they do not seem to make any difference. Here is the code I'm using to set it up:
import os
import sys
import pyspark
from pyspark.sql import SQLContext
from pyspark.context import SparkContext
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.11.83,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
sc = SparkContext()
sc.setLogLevel("DEBUG")
sc._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", os.environ.get("AWS_ACCESS_KEY_ID"))
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", os.environ.get("AWS_SECRET_ACCESS_KEY"))
sc._jsc.hadoopConfiguration().set("fs.s3a.session.token", os.environ.get("AWS_SESSION_TOKEN"))
sql_context = SQLContext(sc)
Why is TemporaryAWSCredentialsProvider not being used?
Which Hadoop version are you using?
S3A STS support was added in Hadoop 2.8.0, and this was the exact error message i got on Hadoop 2.7.
Wafle is right, its 2.8+ only.
But you might be able to get away with setting the AWS_ environment variables and have the session secrets being picked up that way, as AWS environment variable support has long been in there, and I think it will pick up the AWS_SESSION_TOKEN
See AWS docs

Enable Cloud Vision API to access a file on Cloud Storage

i have already seen there are some similar questions but none of them actually provide a full answer.
Since I cannot comment in that thread, i am opening a new one.
How do I address Brandon's comment below?
"...
In order to use the Cloud Vision API with a non-public GCS object,
you'll need to send OAuth authentication information along with your
request for a user or service account which has permission to read the
GCS object."?
I have the json file the system gave me as described here when I created the service account.
I am trying to run the api from a python script.
It is not clear how to use it.
I'd recommend to use the Vision API Client Library for python to perform the call. You can install it on your machine (ideally in a virtualenv) by running the following command:
pip install --upgrade google-cloud-vision
Next, You'll need to set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of the JSON file that contains your service account key. For example, on a Linux machine you'd do it like this:
export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"
Finally, you'll just have to call the Vision API client's method you desire (for example here the label_detection method) like so:
def detect_labels():
"""Detects labels in the file located in Google Cloud Storage."""
client = vision.ImageAnnotatorClient()
image = types.Image()
image.source.image_uri = "gs://bucket_name/path_to_image_object"
response = client.label_detection(image=image)
labels = response.label_annotations
print('Labels:')
for label in labels:
print(label.description)
By initialyzing the client with no parameter, the library will automatically look for the GOOGLE_APPLICATION_CREDENTIALS environment variable you've previously set and run on behalf of this service account. If you granted it permissions to access the file, it'll run successfully.

Configuring Google cloud bucket as Airflow Log folder

We just started using Apache airflow in our project for our data pipelines .While exploring the features came to know about configuring remote folder as log destination in airflow .For that we
Created a google cloud bucket.
From Airflow UI created a new GS connection
I am not able to understand all the fields .I just created a sample GS Bucket under my project from google console and gave that project ID to this Connection.Left key file path and scopes as blank.
Then edited airflow.cfg file as follows
remote_base_log_folder = gs://my_test_bucket/
remote_log_conn_id = test_gs
After this changes restarted the web server and scheduler .But still my Dags is not writing logs to the GS bucket .I am able to see the logs which is creating logs in base_log_folder .But nothing is created in my bucket .
Is there any extra configuration needed from my side to get it working
Note: Using Airflow 1.8 .(Same issue I faced with AmazonS3 also. )
Updated on 20/09/2017
Tried the GS method attaching screenshot
Still I am not getting logs in the bucket
Thanks
Anoop R
I advise you to use a DAG to connect airflow to GCP instead of UI.
First, create a service account on GCP and download the json key.
Then execute this DAG (you can modify the scope of your access):
from airflow import DAG
from datetime import datetime
from airflow.operators.python_operator import PythonOperator
def add_gcp_connection(ds, **kwargs):
"""Add a airflow connection for GCP"""
new_conn = Connection(
conn_id='gcp_connection_id',
conn_type='google_cloud_platform',
)
scopes = [
"https://www.googleapis.com/auth/pubsub",
"https://www.googleapis.com/auth/datastore",
"https://www.googleapis.com/auth/bigquery",
"https://www.googleapis.com/auth/devstorage.read_write",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/cloud-platform",
]
conn_extra = {
"extra__google_cloud_platform__scope": ",".join(scopes),
"extra__google_cloud_platform__project": "<name_of_your_project>",
"extra__google_cloud_platform__key_path": '<path_to_your_json_key>'
}
conn_extra_json = json.dumps(conn_extra)
new_conn.set_extra(conn_extra_json)
session = settings.Session()
if not (session.query(Connection).filter(Connection.conn_id ==
new_conn.conn_id).first()):
session.add(new_conn)
session.commit()
else:
msg = '\n\tA connection with `conn_id`={conn_id} already exists\n'
msg = msg.format(conn_id=new_conn.conn_id)
print(msg)
dag = DAG('add_gcp_connection', start_date=datetime(2016,1,1), schedule_interval='#once')
# Task to add a connection
AddGCPCreds = PythonOperator(
dag=dag,
task_id='add_gcp_connection_python',
python_callable=add_gcp_connection,
provide_context=True)
Thanks to Yu Ishikawa for this code.
Yes, you need to provide additional information for both, S3 and GCP connection.
S3
Configuration is passed via extra field as JSON. You can provide only profile
{"profile": "xxx"}
or credentials
{"profile": "xxx", "aws_access_key_id": "xxx", "aws_secret_access_key": "xxx"}
or path to config file
{"profile": "xxx", "s3_config_file": "xxx", "s3_config_format": "xxx"}
In case of the first option, boto will try to detect your credentials.
Source code - airflow/hooks/S3_hook.py:107
GCP
You can either provide key_path and scope (see Service account credentials) or credentials will be extracted from your environment in this order:
Environment variable GOOGLE_APPLICATION_CREDENTIALS pointing to a file with stored credentials information.
Stored "well known" file associated with gcloud command line tool.
Google App Engine (production and testing)
Google Compute Engine production environment.
Source code - airflow/contrib/hooks/gcp_api_base_hook.py:68
The reason for logs not being written to your bucket could be related to service account rather than config on airflow itself. Make sure it has access to the mentioned bucket. I had same problems in the past.
Adding more generous permissions to the service account, e.g. even project wide Editor and then narrowing it down. You could also try using gs client with that key and see if you can write to the bucket.
For me personally this scope works fine for writing logs: "https://www.googleapis.com/auth/cloud-platform"