How to activating data pipeline when new files arrived on S3.For EMR scheduling using triggered using SNS when new files arrived on S3.
You can execute data pipeline without using SNS. When files will be arrived into S3 Location
Create S3 event which should invoke lambda function.enter image description here
Create Lambda Function (Make sure the role which you will give that has s3 , lambda, data pipeline permissions).
Paste below code in lambda function to execute data pipeline (mention your data pipeline_id)
import boto3
def lambda_handler(event, context):
try :
client = boto3.client('datapipeline',region_name='ap-southeast-2')
s3_client = boto3.client('s3')
data_pipeline_id="df-09312983K28XXXXXXXX"
response_pipeline = client.describe_pipelines(pipelineIds=[data_pipeline_id])
activate = client.activate_pipeline(pipelineId=data_pipeline_id,parameterValues=[])
except Exception as e:
raise Exception("Pipeline is not found or not active")
Related
We have a AWS Glue DataBrew job which puts the output to some S3 bucket folder. Then a java lambda is notified for this Put notification. But the following sample code throws exception:
S3EventNotification.S3EventNotificationRecord record = event.getRecords().get(0);
String s3Bucket = record.getS3().getBucket().getName();
String s3Key= record.getS3().getObject().getUrlDecodedKey();
//following throws exception --404 NoSuchKey
S3Object s3object = s3Client.getObject(s3Bucket , s3Key);
When seen in logs we see that the key is something like:
**input_files/processed_file_22Dec2022_1671678897600/fdg629ae-4f91-4869-891c-79200772fb92/databrew-test-put-object.temp
So is it that the, lambda gets the file which is still being copied into the S3 folder?. When we upload the file manually using the console, it works fine. But when databrew job uploads it, we are seeing issues.
I expect the file to be read by lambda function with the correct key.
Thanks
What is your trigger event type?
So is it that the, lambda gets the file which is still being copied into the S3 folder?
If you have a Put trigger, Lambda will get triggered when the object upload completes. S3 wouldn't create a temporary object and then delete it.
I haven't used AWS Glue DataBrew before but perhaps that is creating that temporary object? If that is the case, you need to handle it in your code.
We're using boto3 with Linode Object Storage, which is compatible with AWS S3 according to their documentation.
Everything seems to work well, except cross-region copy operation.
When I download an object from source region/bucket and then upload it to destination region/bucket, everything works well. Although, I'd like to avoid that unnecessary upload/download step.
I have the bucket named test-bucket on both regions. And I'd like to copy the object named test-object from us-east-1 to us-southeast-1 cluster.
Here is the example code I'm using:
from boto3 import client
from boto3.session import Session
sess = Session(
aws_access_key_id='***',
aws_secret_access_key='***'
)
s3_client_src = sess.client(
service_name='s3',
region_name='us-east-1',
endpoint_url='https://us-east-1.linodeobjects.com'
)
# test-bucket and test-object are already exists.
s3_client_trg = sess.client(
service_name='s3',
region_name='us-southeast-1',
endpoint_url='https://us-southeast-1.linodeobjects.com'
)
copy_source = {
'Bucket': 'test-bucket',
'Key': 'test-object'
}
s3_client_trg.copy(CopySource=copy_source, Bucket='test-bucket', Key='test-object', SourceClient=s3_client_src)
When I call:
s3_client_src.list_objects(Bucket='test-bucket')['Contents']
It shows me that the test-object exists, But when I run copy, then it throws following message:
An error occurred (NoSuchKey) when calling the CopyObject operation: Unknown
Any help is appreciated!
I am trying to unittest a function that writes data to S3 and then reads the same data from the same S3 location. I am trying to use a moto and boto (2.x) to achieve that [1]. The problem is that the service returns that I am forbidden to access the key [2]. A similar problem (even though that the error message is a bit different) is reported in the moto github repository [3] but it is not resolved yet.
Has anyone ever successfully tested mocked s3 read/write in PySpark to share some insights?
[1]
import boto
from boto.s3.key import Key
from moto import mock_s3
_test_bucket = 'test-bucket'
_test_key = 'data.csv'
#pytest.fixture(scope='function')
def spark_context(request):
conf = SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing")
sc = SparkContext(conf=conf)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", 'test-access-key-id')
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", 'test-secret-access-key')
request.addfinalizer(lambda: sc.stop())
quiet_py4j(sc)
return sc
spark_test = pytest.mark.usefixtures("spark_context")
#spark_test
#mock_s3
def test_tsv_read_from_and_write_to_s3(spark_context):
spark = SQLContext(spark_context)
s3_conn = boto.connect_s3()
s3_bucket = s3_conn.create_bucket(_test_bucket)
k = Key(s3_bucket)
k.key = _test_key
k.set_contents_from_string('')
s3_uri = 's3n://{}/{}'.format(_test_bucket, _test_key)
df = (spark
.read
.csv(s3_uri))
[2]
(...)
E py4j.protocol.Py4JJavaError: An error occurred while calling o33.csv.
E : org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/data.csv' - ResponseCode=403, ResponseMessage=Forbidden
(...)
[3]
https://github.com/spulec/moto/issues/1543
moto is a library which is used to mock aws resources.
1. Create the resource:
If you try to access an S3 Bucket which doesn't exist, aws will return a Forbidden error.
Usually, we need these resources created even before our tests run. So, create a pytest fixture with autouse set to True
import pytest
import boto3
from moto import mock_s3
#pytest.fixture(autouse=True)
def fixture_mock_s3():
with mock_s3():
conn = boto3.resource('s3', region_name='us-east-1')
conn.create_bucket(Bucket='MYBUCKET') # an empty test bucket is created
yield
The above code creates a mock s3 bucket with name "MUBUCKET". The bucket is empty.
The name of the bucket should be same as that of the original bucket.
with autouse, the fixture is automatically available across tests.
You can confidently run tests, as your tests will not have access to the original bucket.
2. Define and run tests involving the resource:
Suppose, you have code that writes a file to S3 Bucket
def write_to_s3(filepath: str):
s3 = boto3.resource('s3', region_name='us-east-1')
s3.Bucket('MYBUCKET').upload_file(filepath, 'A/B/C/P/data.txt')
This can be tested the following way:
from botocore.errorfactory import ClientError
def test_write_to_s3():
dummy_file_path = f"{TEST_DIR}/data/dummy_data.txt"
# The s3 bucket is created by the fixture and not lies empty
# test for emptiness
s3 = boto3.resource('s3', region_name='us-east-1')
bucket = s3.Bucket("MYBUCKET")
objects = list(bucket.objects.filter(Prefix="/"))
assert objects == []
# Now, lets write a file to s3
write_to_s3(dummy_file_path)
# the below assert statement doesn't throw any error
assert s3.head_object(Bucket='MYBUCKET', Key='A/B/C/P/data.txt')
I am trying to upload content on amazon s3 but I am getting this error:
boto3.exceptions.unknownapiversionerror: The 's3' resource does not an
API Valid API versions are: 2006-03-01
import boto3
boto3.resource('s3',**AWS_ACCESS_KEY_ID**,**AWS_PRIVATE_KEY**)
bucket = s3.Bucket( **NAME OF BUCKET**)
obj = bucket.Object(**KEY**)
obj.upload_fileobj(**FILE OBJECT**)
The error is caused by exception raised on "DataNotFound" as in the
boto3.Session source code. Perhaps the developer didn't realize people make the mistake for NOT passing the correct object.
If you read the boto3 documentation example, this is the correct way to upload data.
import boto3
boto3.resource('s3',**AWS_ACCESS_KEY_ID**,**AWS_PRIVATE_KEY**)
bucket = s3.Bucket( **NAME OF BUCKET**)
obj = bucket.Object("prefix/object_key_name")
# You must pass the file object !
with open('filename', 'rb') as fileobject:
obj.upload_fileobj(fileobject)
So what I want to do is set a gpio pin on my rpi whenever an s3 bucket adds or deletes a file. I currently have a lambda function set to trigger whenever this occurs. The problem now is getting the function to set the flag. What I currently have in my lambda function is this. But nothing is coming through on my device shadow. My end goal is to have a folder on my rpi stay in sync with the bucket whenever a file is added or deleted without any user input or a cron job.
import json
import boto3
def lambda_handler(event, context):
client = boto3.client('iot-data', region_name='us-west-2')
# Change topic, qos and payload
response = client.publish(
topic='$aws/things/MyThing/shadow/update',
qos=1,
json.dumps({"state" : { "desired" : { "switch" : "on" }}})
)
Go to the CloudWatch Log for your lambda function, what do it says there?
Since you are intending to update the shadow document, have you tried the function "update_thing_shadow"?