We had a third party create a python based image thumbnail script that we set up to trigger on an S3 ObjectCreated event. We then imported a collection of close to 5,000 images after testing the script, but the sheer volume of the image files ended up filling up the lambda test space during the import and only about 12% of the images ended up having thumbnails created for them.
We need to manually create thumbnails for the other 88%. While I have a php based script I can run from EC2, it's somewhat slow. It occurs to me that I could create them 'on demand' and could avoid having to create thumbnails for all of the files that didn't auto-create already during the import.
Some of the files may never be accessed again by a customer - the existing lambda thumbnailer already has a slight delay that I account for in the javascript setTimeout retry loop, but before invoking this loop, I could conceivably check if it's a recent upload -- e.g. within the last 10 seconds -- whenever a thumbnail is not found then trigger the lambda manually before starting the retry loop.
But to do this, I need to have the ability to trigger the Lambda script with the parameters similar to the event trigger. It appears as though their script is only accessing the bucket name and key from the event values:
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
Being unfamiliar with lambda and still somewhat new to the sdk, I am not sure how I do a lambda trigger that would include those values for the python script.
I can use either the php sdk or the javascript sdk. (or even the cli)
Any help is appreciated.
I think I figured it out, copying the data structure in the python references to create a bare-bones payload and triggering it as event:
$lambda = $awsSvc->getAwsSdkCached()->createLambda();
// bucket = event['Records'][0]['s3']['bucket']['name']
// key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
$bucket = "mybucket";
$key = "somefolder/someimage.jpg";
$payload_json = sprintf('{"Records":[{"s3":{"bucket":{"name":"%s"},"object":{"key":"%s"}}}]}', $bucket, $key);
$params = array(
'FunctionName' => 'ThumbnailGenerator',
'InvocationType' => 'Event',
'LogType' => 'Tail',
'Payload' => $payload_json
);
$result = $lambda->invoke($params);
Related
I am building a Python Lambda in AWS and wanted to add an S3 trigger to it. Following these instructions I saw how to get the bucket and key on which I got the trigger using:
def func(event):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
There is an example of such an object in the link, but I wasn't able, however, to find a description of the entire event object anywhere in AWS' documentation.
Is there a documentation for this object's structure? Where might I find it?
You can find documentation about the whole object in the S3 documentation:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-content-structure.html
I would also advise to iterate the records, because there could be multiple at once:
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = record['s3']['object']['key']
[...]
I read a lot about different scenarios and questions that are about s3 eventual consistency and how to handle it to not get 404 error. But here I have a little bit strange use case/requirement! What I'm doing is writing bunch of files to a temp/transient folder in a s3 bucket (using a spark job and make sure job is not going to fail), then remove the main/destination folder if the previous step is successful, and finally copy files over from temp to main folder in the same bucket. Here is part of my code:
# first writing objects into the tempPrefix here using pyspark
...
# delete the main folder (old data) here
...
# copy files from the temp to the main folder
for obj in bucket.objects.filter(Prefix=tempPrefix):
# this function make sure the specific key is available for read
# by calling HeadObject with retries - throwing exception otherwise
waitForObjectToBeAvaiableForRead(bucketName, obj.key)
copy_source = {
"Bucket": bucketName,
"Key": obj.key
}
new_key = obj.key.replace(tempPrefix, mainPrefix, 1)
new_obj = bucket.Object(new_key)
new_obj.copy(copy_source)
This seems to work to avoid any 404 (NoSuchKey) error for immediate read after write. My question is will the bucket.objects.filter give me the newly written objects/keys always? Can eventual consistency affect that as well? The reason I'm asking this because the HeadObject call (in the waitForObjectToBeAvaiableForRead function) sometimes returns 404 error for a key which is returned by bucket.objects.filter!!! I mean the bucket.objects returns a key which is not available for read!!!
When you delete an object in S3, AWS writes a "delete marker" for the object (this assumes that the bucket is versioned). The object appears to be deleted, but that is a sort of illusion created by AWS.
So, if you are writing objects over previously-existing-but-now-deleted objects then you are actually updating an object which results in "eventual consistency" rather than "strong consistency."
Some helpful comments from AWS docs:
A delete marker is a placeholder (marker) for a versioned object that
was named in a simple DELETE request. Because the object was in a
versioning-enabled bucket, the object was not deleted. The delete
marker, however, makes Amazon S3 behave as if it had been deleted.
If you try to get an object and its current version is a delete
marker, Amazon S3 responds with:
A 404 (Object not found) error
A response header, x-amz-delete-marker: true
Specific Answers
My question is will the bucket.objects.filter give me the newly written objects/keys always?
Yes, newly written objects/keys will be included if you have fewer than 1,000 objects in the bucket. The API returns up to 1,000 objects.
Can eventual consistency affect that as well?
Eventual consistency affects the availability of the latest version of an object, not the presence of the object in filter results. The 404 errors are the result of trying to read newly written objects that were last deleted (and full consistency has not yet been achieved).
I have just switched from carrierwave_backgrounder to carrierwave_direct. I have carrierwave_direct set up and functioning. That is, the main file is being uploaded and can be displayed in the view. However, my uploader versions are not being created.
Following is my job:
class ProcessReceiptJob < ApplicationJob
queue_as :process_receipt
def perform(expense_id, key)
expense = Expense.find expense_id
uploader = expense.receipt
expense.key = key
expense.remote_receipt_url = uploader.direct_fog_url(with_path: true)
expense.save!
# expense.recreate_versions!
end
after_perform do |job|
expense = Expense.find(job.arguments.first)
expense.update_column :receipt_processing, false
end
end
When exactly does carrierwave_direct process the versions---or, when is carrierwave instructed to process the versions? I'm assuming that loading the original image using expense.remote_receipt_url, and then calling save! triggers the uploader to process the versions. Is that correct?
In any case, my original image is being uploaded via a background job---however, the versions are not being created/uploaded.
Do I need to "recreate_versions" even thought they don't previously exist? Do I need to somehow explicitly process versions after pointing to the source file or should that be handled automagically?
I was not saving the model after assigning it :key BEFORE sending it the background worker. I was sending the key to the background worker as an argument and then saving the model in processing the job. This was the problem. It is mentioned in the docs the need to save the model after assigning it :key in the success action.
So, I had to update_attributes(key: params[:key]) and THEN call my background job (where incidentally the model is saved again).
So what I want to do is set a gpio pin on my rpi whenever an s3 bucket adds or deletes a file. I currently have a lambda function set to trigger whenever this occurs. The problem now is getting the function to set the flag. What I currently have in my lambda function is this. But nothing is coming through on my device shadow. My end goal is to have a folder on my rpi stay in sync with the bucket whenever a file is added or deleted without any user input or a cron job.
import json
import boto3
def lambda_handler(event, context):
client = boto3.client('iot-data', region_name='us-west-2')
# Change topic, qos and payload
response = client.publish(
topic='$aws/things/MyThing/shadow/update',
qos=1,
json.dumps({"state" : { "desired" : { "switch" : "on" }}})
)
Go to the CloudWatch Log for your lambda function, what do it says there?
Since you are intending to update the shadow document, have you tried the function "update_thing_shadow"?
I'm trying to organize a large number of CloudWatch alarms for maintainability, and the web console grays out the name field on an edit. Is there another method (preferably something scriptable) for updating the name of CloudWatch alarms? I would prefer a solution that does not require any programming beyond simple executable scripts.
Here's a script we use to do this for the time being:
import sys
import boto
def rename_alarm(alarm_name, new_alarm_name):
conn = boto.connect_cloudwatch()
def get_alarm():
alarms = conn.describe_alarms(alarm_names=[alarm_name])
if not alarms:
raise Exception("Alarm '%s' not found" % alarm_name)
return alarms[0]
alarm = get_alarm()
# work around boto comparison serialization issue
# https://github.com/boto/boto/issues/1311
alarm.comparison = alarm._cmp_map.get(alarm.comparison)
alarm.name = new_alarm_name
conn.update_alarm(alarm)
# update actually creates a new alarm because the name has changed, so
# we have to manually delete the old one
get_alarm().delete()
if __name__ == '__main__':
alarm_name, new_alarm_name = sys.argv[1:3]
rename_alarm(alarm_name, new_alarm_name)
It assumes you're either on an ec2 instance with a role that allows this, or you've got a ~/.boto file with your credentials. It's easy enough to manually add yours.
Unfortunately it looks like this is not currently possible.
I looked around for the same solution but it seems neither console nor cloudwatch API provides that feature.
Note:
But we can copy the existing alram with the same parameter and can save on new name
.