Flink streaming job won't connect to localstack s3 - amazon-s3

I'm using a product called Localstack to mock Amazon S3 locally, which is serving as a streaming file sink for a Flink job.
In the run logs, I can see that Flink disregards Localstack and attempts to contact Amazon S3.
Received error response: org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Service Unavailable
Retrying Request: HEAD https://s3.amazonaws.com /testBucket/
In a flink-conf.yaml, I've specified the following configuration properties:
s3.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
s3.buffer.dir: ./tmp
s3.endpoint: localhost:4566
s3.path.style.access: true
s3.access-key: ***
s3.secret-key: ***
Why might Flink disregard the s3.endpoint?

You are almost right with your config, you will need to add http when using localstack:
s3.endpoint: http://localhost:4566
and maybe try with extra dummy secrets as environment variables:
AWS_ACCESS_KEY_ID=foo
AWS_SECRET_ACCESS_KEY=bar

Related

AWS S3 Connection in druid

I have set up a clustered Druid with the configuration as mentioned in the Druid documentation
https://druid.apache.org/docs/latest/tutorials/cluster.html
I am using AWS S3 for deep storage. Following is the snippet of my common configuration file
druid.extensions.loadList=["druid-datasketches", "mysql-metadata-storage", "druid-s3-extensions", "druid-orc-extensions", "druid-lookups-cached-global"]
# For S3:
druid.storage.type=s3
druid.storage.bucket=bucket-name
druid.storage.baseKey=druid/segments
#druid.storage.disableAcl=true
druid.storage.sse.type=s3
#druid.s3.accessKey=...
#druid.s3.secretKey=...
# For S3:
druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=bucket-name
druid.indexer.logs.s3Prefix=druid/stage/indexing-logs
While running any ingestion task I am getting Access denied error
Java.io.IOException: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: ; S3 Extended Request ID: ), S3 Extended Request ID:
at org.apache.druid.storage.s3.S3DataSegmentPusher.push(S3DataSegmentPusher.java:103) ~[?:?]
at org.apache.druid.segment.realtime.appenderator.AppenderatorImpl.lambda$mergeAndPush$4(AppenderatorImpl.java:791) ~[druid-server-0.19.0.jar:0.19.0]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:87) ~[druid-core-0.19.0.jar:0.19.0]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:115) ~[druid-core-0.19.0.jar:0.19.0]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:105) ~[druid-core-0.19.0.jar:0.19.0]
I am using s3 for two purposes
read data from s3 and ingest it. This connection is working fine and data is being from s3 location
for deep storage. I am getting error over here.
I am using Profile information authentication method to provide s3 credential. So I already have configured aws cli with appropriate credentials. Also, s3 data is encrypted by AES256 so i have added druid.storage.sse.type=s3 in config file.
Can someone help me out here as I am not able to debug the issue.
You asked how to approach debugging this. Normally I would:
Ssh onto the ec2 instance and run aws sts get-caller-identity. This will tell you what principal your requests are sent from. Then, I would confirm that principal has the S3 access that is expected.
I would confirm that I can write to the bucket in your configuration.
druid.storage.type=s3
druid.storage.bucket=<bucket-name>
druid.storage.baseKey=druid/segments
I would try some of the other auth methods such as exporting the keys into the environment mentioned in the third option since that is a simple test. Then I would run step 1 again to confirm my principal reflects those keys. And then I would try running your code again.

How to programmatically set up Airflow 1.10 logging with localstack s3 endpoint?

In attempt to setup airflow logging to localstack s3 buckets, for local and kubernetes dev environments, I am following the airflow documentation for logging to s3. To give a little context, localstack is a local AWS cloud stack with AWS services including s3 running locally.
I added the following environment variables to my airflow containers similar to this other stack overflow post in attempt to log to my local s3 buckets. This is what I added to docker-compose.yaml for all airflow containers:
- AIRFLOW__CORE__REMOTE_LOGGING=True
- AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://local-airflow-logs
- AIRFLOW__CORE__REMOTE_LOG_CONN_ID=MyS3Conn
- AIRFLOW__CORE__ENCRYPT_S3_LOGS=False
I've also added my localstack s3 creds to airflow.cfg
[MyS3Conn]
aws_access_key_id = foo
aws_secret_access_key = bar
aws_default_region = us-east-1
host = http://localstack:4572 # s3 port. not sure if this is right place for it
Additionally, I've installed apache-airflow[hooks], and apache-airflow[s3], though it's not clear which one is really needed based on the documentation.
I've followed the steps in a previous stack overflow post in attempt verify if the S3Hook can write to my localstack s3 instance:
from airflow.hooks import S3Hook
s3 = S3Hook(aws_conn_id='MyS3Conn')
s3.load_string('test','test',bucket_name='local-airflow-logs')
But I get botocore.exceptions.NoCredentialsError: Unable to locate credentials.
After adding credentials to airflow console under /admin/connection/edit as depicted:
this is the new exception, botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. is returned. Other people have encountered this same issue and it may have been related to networking.
Regardless, a programatic setup is needed, not a manual one.
I was able to access the bucket using a standalone Python script (entering AWS credentials explicitly with boto), but it needs to work as part of airflow.
Is there a proper way to set up host / port / credentials for S3Hook by adding MyS3Conn to airflow.cfg?
Based on the airflow s3 hooks source code, it seems a custom s3 URL may not yet be supported by airflow. However, based on the airflow aws_hook source code (parent) it seems it should be possible to set the endpoint_url including port, and it should be read from airflow.cfg.
I am able to inspect and write to my s3 bucket in localstack using boto alone. Also, curl http://localstack:4572/local-mochi-airflow-logs returns the contents of the bucket from the airflow container. And aws --endpoint-url=http://localhost:4572 s3 ls returns Could not connect to the endpoint URL: "http://localhost:4572/".
What other steps might be needed to log to localstack s3 buckets from airflow running in docker, with automated setup and is this even supported yet?
I think you're supposed to use localhost not localstack for the endpoint, e.g. host = http://localhost:4572.
In Airflow 1.10 you can override the endpoint on a per-connection basis but unfortunately it only supports one endpoint at a time so you'd be changing it for all AWS hooks using the connection. To override it, edit the relevant connection and in the "Extra" field put:
{"host": "http://localhost:4572"}
I believe this will fix it?
I managed to make this work by referring to this guide. Basically you need to create a connection using the Connection class and pass the credentials that you need, in my case I needed AWS_SESSION_TOKEN, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, REGION_NAME to make this work. Use this function as a python_callable in a PythonOperator which should be the first part of the DAG.
import os
import json
from airflow.models.connection import Connection
from airflow.exceptions import AirflowFailException
def _create_connection(**context):
"""
Sets the connection information about the environment using the Connection
class instead of doing it manually in the Airflow UI
"""
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")
REGION_NAME = os.getenv("REGION_NAME")
credentials = [
AWS_SESSION_TOKEN,
AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY,
REGION_NAME,
]
if not credentials or any(not credential for credential in credentials):
raise AirflowFailException("Environment variables were not passed")
extras = json.dumps(
dict(
aws_session_token=AWS_SESSION_TOKEN,
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
region_name=REGION_NAME,
),
)
try:
Connection(
conn_id="s3_con",
conn_type="S3",
extra=extras,
)
except Exception as e:
raise AirflowFailException(
f"Error creating connection to Airflow :{e!r}",
)

Redshift Spectrum / The bucket you are attempting to access must be addressed using the specified endpoint

I created a parquet file in S3 and an external table pointing to it in Redshift / Spectrum. Both my S3 bucket and Redshift cluster are in us-west-2. I specified the option region when creating the schema.
Queries run smoothly in Athena.
Yet when I run from Redshift client, I get this error:
Amazon Invalid operation: S3 Query Exception (Fetch)
Details:
error: S3 Query Exception (Fetch)
code: 15001
context: Task failed due to an internal error.
HTTP response error code: 301 Message: PermanentRedirect The bucket you are attempting to access must be addressed using the specified endpoint. >Please send all future requests to this endpoint.
x-amz-request-id: XXXX
query: XXXXX
location: dory_util.cpp:689
process: query0_40 [pid=XXX]
-----------------------------------------------;
AWS has acknowledged the issue and released a patch overnight.
Please make sure that your Redshift cluster is running with at least version 1.0.14016 in us-east-2 or us-west-2 and 1.0.1407 in us-east-1. To apply the patch to Redshift immediately, move the maintenance window of your cluster closer to the current time and day to pick it up at your convenience.

Amazon S3 File Read Timeout. Trying to download a file using JAVA

New to Amazon S3 usage.I get the following error when trying to access the file from Amazon S3 using a simple java method.
2016-08-23 09:46:48 INFO request:450 - Received successful response:200, AWS Request ID: F5EA01DB74D0D0F5
Caught an AmazonClientException, which means the client encountered an
internal error while trying to communicate with S3, such as not being
able to access the network.
Error Message: Unable to store object contents to disk: Read timed out
The exact lines of code worked yesterday.I was able to download 100% of 5GB file in 12 min. Today I'm in a better connected environment but only 2% or 3% of the file is downloaded and then the program fails.
Code that I'm using to download.
s3Client.getObject(new GetObjectRequest("mybucket", file.getKey()), localFile);
You need to set the connection timeout and the socket timeout in your client configuration.
Click here for a reference article
Here is an excerpt from the article:
Several HTTP transport options can be configured through the com.amazonaws.ClientConfiguration object. Default values will suffice for the majority of users, but users who want more control can configure:
Socket timeout
Connection timeout
Maximum retry attempts for retry-able errors
Maximum open HTTP connections
Here is an example on how to do it:
Downloading files >3Gb from S3 fails with "SocketTimeoutException: Read timed out"

Elasticsearch Writes Into S3 Bucket for Metadata But Doesn't Write Into S3 Bucket For Scheduled Snapshots?

According to the documents at http://www.elasticsearch.org/tutorials/2011/08/22/elasticsearch-on-ec2.html, I have included the respective API Access Key ID with its secret access key and Elasticsearch is able to write into the S3 bucket as follows:
[2012-08-02 04:21:38,793][DEBUG][gateway.s3] [Schultz, Herman] writing to gateway org.elasticsearch.gateway.shared.SharedStorageGateway$2#4e64f6fe ...
[2012-08-02 04:21:39,337][DEBUG][gateway.s3] [Schultz, Herman] wrote to gateway org.elasticsearch.gateway.shared.SharedStorageGateway$2#4e64f6fe, took 543ms
However when it comes to writing snapshots into the S3 bucket, out comes the following error:
[2012-08-02 04:25:37,303][WARN ][index.gateway.s3] [Schultz, Herman] [plumbline_2012.08.02][3] failed to read commit point [commit-i] java.io.IOException: Failed to get [commit-i]
Caused by: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: E084E2ED1E68E710, AWS Error Code: InvalidAccessKeyId, AWS Error Message: The AWS Access Key Id you provided does not exist in our records.
[2012-08-02 04:36:06,696][WARN ][index.gateway] [Schultz, Herman] [plumbline_2012.08.02][0] failed to snapshot (scheduled) org.elasticsearch.index.gateway.IndexShardGatewaySnapshotFailedException: [plumbline_2012.08.02][0] Failed to perform snapshot (index files)
Is there a reason why this is happening since the access keys I have provided is able to write metadata and not creating snapshots?