connection error from aws fargete to gcp bigquery by using Workload Identity - google-bigquery

I used Workload Identity from AWS EC2 to GCP Bigquery by using assigned role on EC2, and it worked fine.
However when I use Workload Identity from AWS Fargete to GCP Bigquery by using fargate task role, it does not work.
How should I set up the Workload Identity on this case?
I used the libraries below.
implementation(platform("com.google.cloud:libraries-bom:20.9.0"))
implementation("com.google.cloud:google-cloud-bigquery")
Stacktrace has messages below
com.google.cloud.bigquery.BigQueryException: Failed to retrieve AWS IAM role.
at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:115) ~[google-cloud-bigquery-1.137.1.jar!/:1.137.1]
…
at java.base/java.lang.Thread.run(Unknown Source) ~[na:na]
Caused by: java.io.IOException: Failed to retrieve AWS IAM role.
at com.google.auth.oauth2.AwsCredentials.retrieveResource(AwsCredentials.java:217) ~[google-auth-library-oauth2-http-0.26.0.jar!/:na]
…
at com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getDataset(HttpBigQueryRpc.java:126) ~[google-cloud-bigquery-1.137.1.jar!/:1.137.1]
... 113 common frames omitted
Caused by: java.net.ConnectException: Invalid argument (connect failed)
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method) ~[na:na]
at com.google.auth.oauth2.AwsCredentials.retrieveResource(AwsCredentials.java:214) ~[google-auth-library-oauth2-http-0.26.0.jar!/:na]
... 132 common frames omitted

I faced a similar issue with Google Cloud Storage (GCS).
As Peter mentioned, retrieving the credentials on an AWS Farage task is not the same as if the code is running on an EC2 instance, therefore Google SDK fails to compose the correct AWS credentials for exchange with Google Workload Identity Federation.
I came up with a workaround that saved the trouble of editing core files in "../google/auth/aws.py" by doing 2 things:
Get session credentials with boto3
import boto3
task_credentials = boto3.Session().get_credentials().get_frozen_credentials()
Set the relevant environment variables
from google.auth.aws import environment_vars
os.environ[environment_vars.AWS_ACCESS_KEY_ID] = task_credentials.access_key
os.environ[environment_vars.AWS_SECRET_ACCESS_KEY] = task_credentials.secret_key
os.environ[environment_vars.AWS_SESSION_TOKEN] = task_credentials.token
Explanation:
I am using Python3.9 with boto3 and google-cloud==2.4.0, however it should work for other versions of google SDK if the following code is in the function "_get_security_credentials" under the class "Credentials" in "google.auth.aws" package:
# Check environment variables for permanent credentials first.
# https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html
env_aws_access_key_id = os.environ.get(environment_vars.AWS_ACCESS_KEY_ID)
env_aws_secret_access_key = os.environ.get(
environment_vars.AWS_SECRET_ACCESS_KEY
)
# This is normally not available for permanent credentials.
env_aws_session_token = os.environ.get(environment_vars.AWS_SESSION_TOKEN)
if env_aws_access_key_id and env_aws_secret_access_key:
return {
"access_key_id": env_aws_access_key_id,
"secret_access_key": env_aws_secret_access_key,
"security_token": env_aws_session_token,
}
Caveat:
When running code inside an ECS task the credentials that are being used are temporary (ECS assumes the task's role), therefore you can't generate temporary credentials via AWS STS as it is usually recommended.
Why is it a problem? Well since a task is running with temporary credentials it is subjected to expire & refresh. In order to solve that you can set up a background function that will do the operation again every 5 minutes or so (Haven't faced a problem where the temporary credentials expired).

I had the same issue but for Python code, anyway I think it should be the same.
You're getting this as getting the AWS IAM role at AWS Fargate is different from AWS EC2, where EC2 you can get them from instance metadata, as shown here:
curl http://169.254.169.254/latest/meta-data/iam/security-credentials/s3access
While in AWS Faragte:
curl 169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI
So to get around that, the following need to be done:
Change GCP Workload Identity Federation Credential file content [wif_cred_file] as the following:
wif_cred_file["credential_source"]["url"]=f"http://169.254.170.2{AWS_CONTAINER_CREDENTIALS_RELATIVE_URI}"
In the "python3.8/site-packages/google/auth/aws.py" file in the library [Try to find the similar file in Java], I've updated this code as the following:
Comment this line:
# role_name = self._get_metadata_role_name(request)
Remove role_name from _get_metadata_security_credentials function args.
Or if you like, you may change step 1 at the aws.py file, both ways should be fine.
And that should be it.

Related

AWS S3 Connection in druid

I have set up a clustered Druid with the configuration as mentioned in the Druid documentation
https://druid.apache.org/docs/latest/tutorials/cluster.html
I am using AWS S3 for deep storage. Following is the snippet of my common configuration file
druid.extensions.loadList=["druid-datasketches", "mysql-metadata-storage", "druid-s3-extensions", "druid-orc-extensions", "druid-lookups-cached-global"]
# For S3:
druid.storage.type=s3
druid.storage.bucket=bucket-name
druid.storage.baseKey=druid/segments
#druid.storage.disableAcl=true
druid.storage.sse.type=s3
#druid.s3.accessKey=...
#druid.s3.secretKey=...
# For S3:
druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=bucket-name
druid.indexer.logs.s3Prefix=druid/stage/indexing-logs
While running any ingestion task I am getting Access denied error
Java.io.IOException: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: ; S3 Extended Request ID: ), S3 Extended Request ID:
at org.apache.druid.storage.s3.S3DataSegmentPusher.push(S3DataSegmentPusher.java:103) ~[?:?]
at org.apache.druid.segment.realtime.appenderator.AppenderatorImpl.lambda$mergeAndPush$4(AppenderatorImpl.java:791) ~[druid-server-0.19.0.jar:0.19.0]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:87) ~[druid-core-0.19.0.jar:0.19.0]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:115) ~[druid-core-0.19.0.jar:0.19.0]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:105) ~[druid-core-0.19.0.jar:0.19.0]
I am using s3 for two purposes
read data from s3 and ingest it. This connection is working fine and data is being from s3 location
for deep storage. I am getting error over here.
I am using Profile information authentication method to provide s3 credential. So I already have configured aws cli with appropriate credentials. Also, s3 data is encrypted by AES256 so i have added druid.storage.sse.type=s3 in config file.
Can someone help me out here as I am not able to debug the issue.
You asked how to approach debugging this. Normally I would:
Ssh onto the ec2 instance and run aws sts get-caller-identity. This will tell you what principal your requests are sent from. Then, I would confirm that principal has the S3 access that is expected.
I would confirm that I can write to the bucket in your configuration.
druid.storage.type=s3
druid.storage.bucket=<bucket-name>
druid.storage.baseKey=druid/segments
I would try some of the other auth methods such as exporting the keys into the environment mentioned in the third option since that is a simple test. Then I would run step 1 again to confirm my principal reflects those keys. And then I would try running your code again.

How to programmatically set up Airflow 1.10 logging with localstack s3 endpoint?

In attempt to setup airflow logging to localstack s3 buckets, for local and kubernetes dev environments, I am following the airflow documentation for logging to s3. To give a little context, localstack is a local AWS cloud stack with AWS services including s3 running locally.
I added the following environment variables to my airflow containers similar to this other stack overflow post in attempt to log to my local s3 buckets. This is what I added to docker-compose.yaml for all airflow containers:
- AIRFLOW__CORE__REMOTE_LOGGING=True
- AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://local-airflow-logs
- AIRFLOW__CORE__REMOTE_LOG_CONN_ID=MyS3Conn
- AIRFLOW__CORE__ENCRYPT_S3_LOGS=False
I've also added my localstack s3 creds to airflow.cfg
[MyS3Conn]
aws_access_key_id = foo
aws_secret_access_key = bar
aws_default_region = us-east-1
host = http://localstack:4572 # s3 port. not sure if this is right place for it
Additionally, I've installed apache-airflow[hooks], and apache-airflow[s3], though it's not clear which one is really needed based on the documentation.
I've followed the steps in a previous stack overflow post in attempt verify if the S3Hook can write to my localstack s3 instance:
from airflow.hooks import S3Hook
s3 = S3Hook(aws_conn_id='MyS3Conn')
s3.load_string('test','test',bucket_name='local-airflow-logs')
But I get botocore.exceptions.NoCredentialsError: Unable to locate credentials.
After adding credentials to airflow console under /admin/connection/edit as depicted:
this is the new exception, botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. is returned. Other people have encountered this same issue and it may have been related to networking.
Regardless, a programatic setup is needed, not a manual one.
I was able to access the bucket using a standalone Python script (entering AWS credentials explicitly with boto), but it needs to work as part of airflow.
Is there a proper way to set up host / port / credentials for S3Hook by adding MyS3Conn to airflow.cfg?
Based on the airflow s3 hooks source code, it seems a custom s3 URL may not yet be supported by airflow. However, based on the airflow aws_hook source code (parent) it seems it should be possible to set the endpoint_url including port, and it should be read from airflow.cfg.
I am able to inspect and write to my s3 bucket in localstack using boto alone. Also, curl http://localstack:4572/local-mochi-airflow-logs returns the contents of the bucket from the airflow container. And aws --endpoint-url=http://localhost:4572 s3 ls returns Could not connect to the endpoint URL: "http://localhost:4572/".
What other steps might be needed to log to localstack s3 buckets from airflow running in docker, with automated setup and is this even supported yet?
I think you're supposed to use localhost not localstack for the endpoint, e.g. host = http://localhost:4572.
In Airflow 1.10 you can override the endpoint on a per-connection basis but unfortunately it only supports one endpoint at a time so you'd be changing it for all AWS hooks using the connection. To override it, edit the relevant connection and in the "Extra" field put:
{"host": "http://localhost:4572"}
I believe this will fix it?
I managed to make this work by referring to this guide. Basically you need to create a connection using the Connection class and pass the credentials that you need, in my case I needed AWS_SESSION_TOKEN, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, REGION_NAME to make this work. Use this function as a python_callable in a PythonOperator which should be the first part of the DAG.
import os
import json
from airflow.models.connection import Connection
from airflow.exceptions import AirflowFailException
def _create_connection(**context):
"""
Sets the connection information about the environment using the Connection
class instead of doing it manually in the Airflow UI
"""
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")
REGION_NAME = os.getenv("REGION_NAME")
credentials = [
AWS_SESSION_TOKEN,
AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY,
REGION_NAME,
]
if not credentials or any(not credential for credential in credentials):
raise AirflowFailException("Environment variables were not passed")
extras = json.dumps(
dict(
aws_session_token=AWS_SESSION_TOKEN,
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
region_name=REGION_NAME,
),
)
try:
Connection(
conn_id="s3_con",
conn_type="S3",
extra=extras,
)
except Exception as e:
raise AirflowFailException(
f"Error creating connection to Airflow :{e!r}",
)

I have an error "java.io.FileNotFoundException: No such file or directory" while trying to create a dynamic frame using a notebook in AWS Glue

I'm setting up a new Jupyter Notebook in AWS Glue as a dev endpoint in order to test out some code for running an ETL script. So far I created a basic ETL script using AWS Glue but, for some reason, when trying to run the code on the Jupyter Notebook, I keep getting a FileNotFoundException.
I'm using a table (in the data catalog) that was created by an AWS Crawler to fetch the information associated with an S3 bucket and I'm able to actually get the filenames inside the bucket, but when I try to read the file using the dynamic frame, an FileNotFoundException is thrown.
Has anyone ever had this issue before?
This is running on N.Virginia AWS account. I've already set up the permissions, granted IAM roles to the AWS Glue service, setup the VPC endpoints and tried running the Job directly in AWS Glue, to no avail.
This is the basic code:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "xxx-database", table_name = "mytable_item", transformation_ctx = "datasource0")
datasource0.printSchema()
datasource0.show()
Alternatively:
datasource0 = glueContext.create_dynamic_frame.from_options('s3', connection_options={"paths":["s3://my-bucket/92387979/My-Table-Item/2016-09-11T16:30:00.000Z/"]}, format="json", transformation_ctx="")
datasource0.printSchema()
datasource0.show()
I would expect to receive a dynamic frame content, but this is actually throwing this error:
An error occurred while calling o343.schema.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 23, ip-172-31-87-88.ec2.internal, executor 6): java.io.FileNotFoundException: No such file or directory 's3://my-bucket/92387979/My-Table-Item/2016-09-11T16:30:00.000Z/92387979-My-Table-Item-2016-09-11T16:30:00.000Z.json.gz'
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:826)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1206)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166)
at com.amazonaws.services.glue.hadoop.TapeHadoopRecordReader.initialize(TapeHadoopRecordReader.scala:99)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:182)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:179)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
Thanks in advance for any help given.
Well, as Chris D'Englere and Harsh Bafna pointed out, it was indeed a permission's issue. As it turns out, I forgot to add specific S3 permissions for the objects (GetObject) inside the bucket and not only to the bucket itself.
Thanks for the help!
The issue is with S3 permissions in all likelihood.
Go to IAM and add s3:GetObject permissions to the policy that is attached to the role that Glue is using. Make sure to specify the specific S3 bucket for the resource associated with the policy.
I had the same issue, tried what I described above, and now it's working.

Configuring Google cloud bucket as Airflow Log folder

We just started using Apache airflow in our project for our data pipelines .While exploring the features came to know about configuring remote folder as log destination in airflow .For that we
Created a google cloud bucket.
From Airflow UI created a new GS connection
I am not able to understand all the fields .I just created a sample GS Bucket under my project from google console and gave that project ID to this Connection.Left key file path and scopes as blank.
Then edited airflow.cfg file as follows
remote_base_log_folder = gs://my_test_bucket/
remote_log_conn_id = test_gs
After this changes restarted the web server and scheduler .But still my Dags is not writing logs to the GS bucket .I am able to see the logs which is creating logs in base_log_folder .But nothing is created in my bucket .
Is there any extra configuration needed from my side to get it working
Note: Using Airflow 1.8 .(Same issue I faced with AmazonS3 also. )
Updated on 20/09/2017
Tried the GS method attaching screenshot
Still I am not getting logs in the bucket
Thanks
Anoop R
I advise you to use a DAG to connect airflow to GCP instead of UI.
First, create a service account on GCP and download the json key.
Then execute this DAG (you can modify the scope of your access):
from airflow import DAG
from datetime import datetime
from airflow.operators.python_operator import PythonOperator
def add_gcp_connection(ds, **kwargs):
"""Add a airflow connection for GCP"""
new_conn = Connection(
conn_id='gcp_connection_id',
conn_type='google_cloud_platform',
)
scopes = [
"https://www.googleapis.com/auth/pubsub",
"https://www.googleapis.com/auth/datastore",
"https://www.googleapis.com/auth/bigquery",
"https://www.googleapis.com/auth/devstorage.read_write",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/cloud-platform",
]
conn_extra = {
"extra__google_cloud_platform__scope": ",".join(scopes),
"extra__google_cloud_platform__project": "<name_of_your_project>",
"extra__google_cloud_platform__key_path": '<path_to_your_json_key>'
}
conn_extra_json = json.dumps(conn_extra)
new_conn.set_extra(conn_extra_json)
session = settings.Session()
if not (session.query(Connection).filter(Connection.conn_id ==
new_conn.conn_id).first()):
session.add(new_conn)
session.commit()
else:
msg = '\n\tA connection with `conn_id`={conn_id} already exists\n'
msg = msg.format(conn_id=new_conn.conn_id)
print(msg)
dag = DAG('add_gcp_connection', start_date=datetime(2016,1,1), schedule_interval='#once')
# Task to add a connection
AddGCPCreds = PythonOperator(
dag=dag,
task_id='add_gcp_connection_python',
python_callable=add_gcp_connection,
provide_context=True)
Thanks to Yu Ishikawa for this code.
Yes, you need to provide additional information for both, S3 and GCP connection.
S3
Configuration is passed via extra field as JSON. You can provide only profile
{"profile": "xxx"}
or credentials
{"profile": "xxx", "aws_access_key_id": "xxx", "aws_secret_access_key": "xxx"}
or path to config file
{"profile": "xxx", "s3_config_file": "xxx", "s3_config_format": "xxx"}
In case of the first option, boto will try to detect your credentials.
Source code - airflow/hooks/S3_hook.py:107
GCP
You can either provide key_path and scope (see Service account credentials) or credentials will be extracted from your environment in this order:
Environment variable GOOGLE_APPLICATION_CREDENTIALS pointing to a file with stored credentials information.
Stored "well known" file associated with gcloud command line tool.
Google App Engine (production and testing)
Google Compute Engine production environment.
Source code - airflow/contrib/hooks/gcp_api_base_hook.py:68
The reason for logs not being written to your bucket could be related to service account rather than config on airflow itself. Make sure it has access to the mentioned bucket. I had same problems in the past.
Adding more generous permissions to the service account, e.g. even project wide Editor and then narrowing it down. You could also try using gs client with that key and see if you can write to the bucket.
For me personally this scope works fine for writing logs: "https://www.googleapis.com/auth/cloud-platform"

Upload artifacts to s3 from Jenkins

I am using Jenkins, on Post Build i want to push artifacts to S3.
but i am getting the following error :
Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: E9EF9BE1E1D0C011), S3 Extended Request ID: wsyJXgV9If7Yk/GbgI486HrQ5RFZbvnQt/haOBJq3nZ6aLFbWEvKmnHE9ly+05eOab2qTPOQjZU=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1275)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:873)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:576)
at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:362)
at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:328)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:307)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3659)
at com.amazonaws.services.s3.AmazonS3Client.initiateMultipartUpload(AmazonS3Client.java:2651)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.initiateMultipartUpload(UploadCallable.java:350)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.uploadInParts(UploadCallable.java:178)
at com.amazonaws.services.s3.transfer.internal.UploadCallable.call(UploadCallable.java:121)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:139)
at com.amazonaws.services.s3.transfer.internal.UploadMonitor.call(UploadMonitor.java:47)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I tried with Java 1.8 latest , java 1.7 latest. But getting this error again and again. I tried s3 publish plugin 0.8 and also 0.10.1.
Project Config :
Plugin Config :
You're getting a 403 (forbidden) error, which indicates that you're either missing valid credentials for the bucket, or that the bucket's security settings, such as server-side encryption (SSE), are not being respected.
First, update to the latest version of the S3 publisher plugin - it's added support for SSE, and if your bucket needs it enabled, you can check the box for "Server side encryption" in your pipeline configuration.
Second, you'll need to modify the S3 profile in the Jenkins "Configure System" form. In your question, the highlighted field for your access key is empty, and that must be provided, along with the secret key component.
Once you've entered the configuration correctly and verified that bucket requirements are satisfied, you should be in the clear for pushing your objects to S3.
I had same issue while I tried to push artifacts to an S3 bucket by a Jenkins. Later I found out that it threw errors because I was providing a wrong Bucket in the Jenkins config.