Copy files from S3 to EMR local using Lambda - amazon-s3

I need to move the files from S3 to EMR's local dir /home/hadoop programmatically using Lambda.
S3DistCp copies over to HDFS. I then login into EMR and run a CopyToLocal hdfs command on commandline to get the files to /home/hadoop.
Is there a programmatic way using boto3 in Lambda to copy from S3 to Emr's local dir?

I wrote a test Lambda function to submit a job step to EMR that copies files from S3 to EMR's local dir. This worked.
emrclient = boto3.client('emr', region_name='us-west-2')
def lambda_handler(event, context):
EMRS = emrclient.list_clusters( ClusterStates = ['STARTING', 'RUNNING', 'WAITING'] )
clusters = EMRS["Clusters"]
print(clusters)
for cluster in clusters:
ID = cluster["Id"]
response = emrclient.add_job_flow_steps(JobFlowId=ID,
Steps=[
{
'Name': 'AWS S3 Copy',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args':["aws","s3","cp","s3://XXX/","/home/hadoop/copy/","--recursive"],
}
}
],
)
If there are better ways to do the copy, please do let me know.

That would need a way for the AWS Lambda function to remotely trigger the CopyToLocal command on the cluster.
The Lambda function could call
add-steps to request the cluster to run a script that does this action.

Related

How to update aws_lambda_function Terraform resource when ZIP package is changed on S3?

Zip package is uploaded to S3 not by Terraform.
Lambda is provisioned by Terraform aws_lambda_function resource. When I change Zip package on S3 and run terraform apply command, Terraform says nothing is changed.
There is source_code_hash field in aws_lambda_function resource that can be set to a hash of package content. But whatever value for this hash I provide it's not updated in Terraform state.
How to tell Terraform to update Lambda in case of Zip package update in S3?
After numerous experiments to verify how Terraform handles hash I found the following:
source_code_hash of aws_lambda_function resource is stored in Terraform state at the moment Lambda is provisioned.
source_code_hash is updated only if you provide a new value for it in aws_lambda_function resource and this new value corresponds to hash of actual Zip package in S3.
So Terraform checks actual hash code of the package on S3 only at this moment, it doesn't check it when we run terraform apply.
So to make it work we have the following options:
Download Zip package from S3, calculate its hash and pass it to source_code_hash field of aws_lambda_function resource OR
Upload Zip package to S3 by Terraform using aws_s3_bucket_object resource. Set source_hash field in that resource to save it in Terraform state. This value can be used by aws_lambda_function resource for updates.
Unfortunately this behaviour is not documented and I spent lots of time discovering it. Moreover it can be changed any moment since it's not documented and nobody knows that :-(
So how I solved this problem?
I generate base64-encoded SHA256 hash of Lambda Zip file and store it as metadata for actual Zip file. Then I read this metadata in Terraform and pass it to source_code_hash.
Details:
Generate hash using openssl dgst -binary -sha256 lambda_package.zip | openssl base64 command.
Store hash as metadata during package uploading using aws s3 cp lambda_package.zip s3://my-best-bucket/lambda_package.zip --metadata hash=[HASH_VALUE] command.
Pass hash to source_code_hash in Terraform
data "aws_s3_bucket_object" "package" {
bucket = "my-best-bucket"
key = "lambda_package.zip"
}
resource "aws_lambda_function" "main" {
...
source_code_hash = data.aws_s3_bucket_object.package.metadata.Hash
...
}
Another way of handling this if you can't do the hash of the s3 object you can do the version.
i.e.
data "aws_s3_bucket_object" "application_zip" {
bucket = var.apps_bucket
key = var.app_key
}
resource "aws_lambda_function" "lambda" {
function_name = var.function_name
s3_bucket = var.apps_bucket
s3_key = var.app_key
handler = var.handler
runtime = var.runtime
memory_size = var.memory_size
timeout = var.timeout
role = aws_iam_role.lambda.arn
s3_object_version = data.aws_s3_bucket_object.application_zip.version_id
}
This mean when the object version changes in s3 does the lambda deploy

How to integrate CEPH with Amazon-S3?

I'm trying to adapt the open-source project mmfashion on Amazon SageMaker that requires the CEPH module for backend. Unfortunately pip install ceph doesn't work. The only work-around was to build the ceph source-code manually by running in my container:
!git clone git://github.com/ceph/ceph
!git submodule update --init --recursive
This does allow me to import ceph successfully. But it throws the following error when it comes to fecthing data from Amazon S3:
AttributeError: module 'ceph' has no attribute 'S3Client'
Has someone integrated CEPH with Amazon S3 Bucket or has suggestions in the same line on how to tackle this?
you can use ceph S3 api to connect to AWS buckets , here is the simple python example script to connect to any S3 api :
import boto
import boto.s3.connection
access_key = 'put your access key here!'
secret_key = 'put your secret key here!'
conn = boto.connect_s3(
aws_access_key_id = access_key,
aws_secret_access_key = secret_key,
host = 'objects.dreamhost.com',
#is_secure=False, # uncomment if you are not using ssl
calling_format = boto.s3.connection.OrdinaryCallingFormat(),
)
then you will be able to list the buckets :
for bucket in conn.get_all_buckets():
print "{name}\t{created}".format(
name = bucket.name,
created = bucket.creation_date,
)

creating boto3 s3 client on Airflow with an s3 connection and s3 hook

I am trying to move my python code to Airflow. I have the following code snippet:
s3_client = boto3.client('s3',
region_name="us-west-2",
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key)
I am trying to recreate this s3_client using Aiflow's s3 hook and s3 connection but cant find a way to do it in any documentation without specifying the aws_access_key_id and the aws_secret_access_key directly in code.
Any help would be appreciated
You need to define aws connection in Admin -> Connections or with cli (see docs).
Once the connection defined you can use it in S3Hook.
Your connection object can be set as:
Conn Id: <your_choice_of_conn_id_name>
Conn Type: Amazon Web Services
Login: <aws_access_key>
Password: <aws_secret_key>
Extra: {"region_name": "us-west-2"}
In Airflow the hooks wrap a python package. Thus if your code uses hook there shouldn't be a reason to import boto3 directly.

Insufficient log-delivery permissions when using AWS-cdk and aws lambda

I am trying to create a centralized logging bucket to then log all of other s3 buckets to using lambda and the aws-cdk. The centralized logging bucket has been created but there is an error when using lambda on it to write to it. Here is my code:
import boto3
s3 = boto3.resource('s3')
def handler(event, context):
setBucketPolicy(target_bucket='s3baselinestack-targetloggingbucketbab31bd5-b6y2hkvqz0of')
def setBucketPolicy(target_bucket):
for bucket in s3.buckets.all():
bucket_logging = s3.BucketLogging(bucket.name)
if not bucket_logging.logging_enabled:
reponse = bucket_logging.put(
BucketLoggingStatus={
'LoggingEnabled': {
'TargetBucket': target_bucket,
'TargetPrefix': f'{bucket.name}/'
}
},
)
print(reponse)
Here is my error:
START RequestId: 320e83c0-ba5e-4d54-a78c-a462d6e0cb87 Version: $LATEST
An error occurred (InvalidTargetBucketForLogging) when calling the PutBucketLogging operation: You must give the log-delivery group WRITE and READ_ACP permissions to the target bucket: ClientError
Traceback (most recent call last):
Note: Everything works but this log-delivery permission as when I enable it through the aws console it works fine but, I need to do it programmatically! Thank you in advance.
According to the documentation for S3 logging, you must grant the Log Delivery group WRITE and READ_ACP permissions on the target bucket for logs, and this is done using the S3 ACLs.
https://docs.aws.amazon.com/AmazonS3/latest/dev/enable-logging-programming.html#grant-log-delivery-permissions-general
When creating a new bucket with CDK, this is set using the accessControl property. The default value is BucketAccessControl.PRIVATE.
new s3.Bucket(this, 'bucket', {
accessControl: s3.BucketAccessControl.LOG_DELIVERY_WRITE
})
Since CloudFormation has no way to add ACLs to existing buckets this means that CDK also has no such method. With an existing bucket, add Log Delivery via the web console, the API, or the CLI with aws s3api put-bucket-acl.
Other services, such as CloudFront, don't use ACLs anymore and use IAM policies which can be added using bucket.addToResourcePolicy().
https://docs.aws.amazon.com/cdk/api/latest/docs/#aws-cdk_aws-s3.IBucket.html#add-wbr-to-wbr-resource-wbr-policypermission

How to programmatically set up Airflow 1.10 logging with localstack s3 endpoint?

In attempt to setup airflow logging to localstack s3 buckets, for local and kubernetes dev environments, I am following the airflow documentation for logging to s3. To give a little context, localstack is a local AWS cloud stack with AWS services including s3 running locally.
I added the following environment variables to my airflow containers similar to this other stack overflow post in attempt to log to my local s3 buckets. This is what I added to docker-compose.yaml for all airflow containers:
- AIRFLOW__CORE__REMOTE_LOGGING=True
- AIRFLOW__CORE__REMOTE_BASE_LOG_FOLDER=s3://local-airflow-logs
- AIRFLOW__CORE__REMOTE_LOG_CONN_ID=MyS3Conn
- AIRFLOW__CORE__ENCRYPT_S3_LOGS=False
I've also added my localstack s3 creds to airflow.cfg
[MyS3Conn]
aws_access_key_id = foo
aws_secret_access_key = bar
aws_default_region = us-east-1
host = http://localstack:4572 # s3 port. not sure if this is right place for it
Additionally, I've installed apache-airflow[hooks], and apache-airflow[s3], though it's not clear which one is really needed based on the documentation.
I've followed the steps in a previous stack overflow post in attempt verify if the S3Hook can write to my localstack s3 instance:
from airflow.hooks import S3Hook
s3 = S3Hook(aws_conn_id='MyS3Conn')
s3.load_string('test','test',bucket_name='local-airflow-logs')
But I get botocore.exceptions.NoCredentialsError: Unable to locate credentials.
After adding credentials to airflow console under /admin/connection/edit as depicted:
this is the new exception, botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. is returned. Other people have encountered this same issue and it may have been related to networking.
Regardless, a programatic setup is needed, not a manual one.
I was able to access the bucket using a standalone Python script (entering AWS credentials explicitly with boto), but it needs to work as part of airflow.
Is there a proper way to set up host / port / credentials for S3Hook by adding MyS3Conn to airflow.cfg?
Based on the airflow s3 hooks source code, it seems a custom s3 URL may not yet be supported by airflow. However, based on the airflow aws_hook source code (parent) it seems it should be possible to set the endpoint_url including port, and it should be read from airflow.cfg.
I am able to inspect and write to my s3 bucket in localstack using boto alone. Also, curl http://localstack:4572/local-mochi-airflow-logs returns the contents of the bucket from the airflow container. And aws --endpoint-url=http://localhost:4572 s3 ls returns Could not connect to the endpoint URL: "http://localhost:4572/".
What other steps might be needed to log to localstack s3 buckets from airflow running in docker, with automated setup and is this even supported yet?
I think you're supposed to use localhost not localstack for the endpoint, e.g. host = http://localhost:4572.
In Airflow 1.10 you can override the endpoint on a per-connection basis but unfortunately it only supports one endpoint at a time so you'd be changing it for all AWS hooks using the connection. To override it, edit the relevant connection and in the "Extra" field put:
{"host": "http://localhost:4572"}
I believe this will fix it?
I managed to make this work by referring to this guide. Basically you need to create a connection using the Connection class and pass the credentials that you need, in my case I needed AWS_SESSION_TOKEN, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, REGION_NAME to make this work. Use this function as a python_callable in a PythonOperator which should be the first part of the DAG.
import os
import json
from airflow.models.connection import Connection
from airflow.exceptions import AirflowFailException
def _create_connection(**context):
"""
Sets the connection information about the environment using the Connection
class instead of doing it manually in the Airflow UI
"""
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")
REGION_NAME = os.getenv("REGION_NAME")
credentials = [
AWS_SESSION_TOKEN,
AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY,
REGION_NAME,
]
if not credentials or any(not credential for credential in credentials):
raise AirflowFailException("Environment variables were not passed")
extras = json.dumps(
dict(
aws_session_token=AWS_SESSION_TOKEN,
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
region_name=REGION_NAME,
),
)
try:
Connection(
conn_id="s3_con",
conn_type="S3",
extra=extras,
)
except Exception as e:
raise AirflowFailException(
f"Error creating connection to Airflow :{e!r}",
)