Get notified when user uploads to an S3 bucket? [duplicate] - amazon-s3

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Notification of new S3 objects
We've got an app that stores user data on S3. The part of our app that handles the uploads is decoupled from the part that processes the data. In some cases, the user will be able to upload data directly to S3 without going through our app at all (this may happen if they have their own S3 account and supply us with credentials).
Is it possible to get notified whenever the contents of an S3 bucket change? It would be cool if somehow a message could get sent that says "this file was added/updated/deleted: foo".
Short of that, is there some timestamp somewhere I could poll that would tell the last time the bucket was updated?
If I can't do either of these things, then the only alternative is the crawl the entire bucket and look for changes. This will be slow and expensive.

Update 2014-11:
As Alan Illing points out in the comments, AWS now supports notifications from S3 to SNS, which can be forwarded automatically to SQS: http://aws.amazon.com/blogs/aws/s3-event-notification/
S3 can also send notifications to AWS Lambda to run your own code directly.
Original response that predicted S3->SNS notifications:
If Amazon supported this, they would use SNS to send out notifications that an object has been added to a bucket. However, at the moment, the only bucket event supported by S3 and SNS is to notify you when Amazon S3 detects that it has lost all replicas of a Reduced Redundancy Storage (RRS) object and can no longer service requests for that object.
Here's the documentation on the SNS events supported by S3:
http://docs.amazonwebservices.com/AmazonS3/latest/dev/NotificationHowTo.html
Based on the way that the documentation is written, it looks like Amazon has ideas for other notification events to add (like perhaps your idea for finding out when new keys have been added).
Given that it isn't supported directly by Amazon, the S3 client that uploads the object to S3 will need to trigger the notification, or you will need to do some sort of polling.
Custom event notification for uploads to S3 could be done using SNS if you like to get near-real-time updates for processing, or it can be done through SQS if you like to let the notifications pile up and process them out of a queue at your own pace.
If you are polling, you could reduce the number of keys you need to request by having the client upload with a prefix of, say, "unprocessed/..." followed by the unique key. Your polling software can then query just S3 keys starting with that prefix. When it is ready to process, it could change the key to "processing/..." and then later to "processed/..." or whatever. Objects in S3 are currently renamed by copy+delete operations performed by S3.

Related

Monitor S3 - Send an alert if more than 5 minutes have passed since a last file was written

I have a program that uploads files to S3 every 5 minutes.
I need to monitor it. So I want to check every 10 minutes what is the time of the last file uploaded and if it more than X minutes sends an alert (email) about it.
I understand that I need to use CloudWatch and Lambda. But I don't know how to do it.
Any help, please.
The following AWS products should help you build this:
AWS EventBridge (formerly known as CloudWatch Events)
AWS Lambda
AWS SES
Solution outline:
Create your Lambda function.
Create a scheduled event rule in EventBridge.
When creating the rule, use a rate of 10 minutes.
Set your Lambda from step 1. as target of your rule.
When your Lambda is triggered, run your business logic to check when the last file was uploaded.
If you need to send an email, you can use AWS SES to send it to your recipients.
Important:
You need to allow AWS EventBridge to call your Lambda. If you do all of this in the AWS console, the required permissions should be set automatically. If you use CloudFormation, Terraform or SAM, you probably need to add those permissions to your Lambda.

Microservices + CQRS implementation

I am working on implementing a microservice architecture using the CQRS pattern. I have a working implementation using API Gateway, Lambda and DynamoDB with one exception - the event sourcing.
Event Sourcing has the applications publishing a notification to an event stream that other services in the platform can consume. This notification represents an event that took place as part of the originating HTTP request. For instance, if the user makes a HTTP POST with a complete "check patient into hospital" model then the Lambda will break that apart and publish multiple events in sequential order.
Patient Checked in (includes Patient Id, hospital id + visit id)
Room Assigned (includes room number, + visit id)
Patient tested (includes tested + visit id)
Patient checked-out (visit id)
The intent for this pattern is to provide an audit trail of all events that took place while the patient was in the hospital. This example (not what I'm actually building) would be stored in an event source that can be replayed at any time. If the VisitId was deleted across all services we could just replay the events one at a time, in order, and reproduce an exact copy of the original record. You consider all records immutable to achieve this. Each POST would push into the event source and then land in the database that would pull the data out during a HTTP GET request. It would also have subscribers that would take pieces of this data and do other things - such as a "Visit Survey" service that would listen to the Patient Checked Out event and prep a post-op survey.
I've looked at several AWS services to provide this. I know about Kinesis Data Streams but I don't like the pricing structure nor do I want to deal with shards (no autoscaling). Since my entire platform is built on consumption based pricing (Dynamo, Lambda etc) I want to keep my event source the same way. This makes it easier for me to estimate a per-user cost as I just do math based on estimated requests per month, per user.
I've been using SNS for the stream itself, delivering the notifications, and it's been great. Super fast and not had any major issues while developing it. The issue though is that this is not suitable for a replay store - only delivery of the event messages. For a replay store I thought Kinesis Firehose made a lot of sense... Send it to S3 + SNS at the same time. Turns out SNS isn't a delivery destination available. I can Put to S3 myself and then publish to SNS but that seems like duplicate work in the code base when I can setup an S3 trigger to fire a Lambda and just have another small Lambda that reacts to the Event landing in S3 and do the insert into the DynamoDB. I've seen that this can be much slower though than just publishing through SNS. I'm also not sure about retry policies on the Put event. This simplifies retries though as I can just re-use the code in the triggered Lambda to replay all events in a bucket path.
I could just PutObject and then Publish to SNS within the same HTTP POST Lambda. If the SNS Publish fails though then I now have an object in S3 that was never published. I'd have to write a different Lambda to handle the fixing and publishing. Not the end of the world - either-way I have two Lambdas to deploy. I'm just not sure which way makes more sense in this pattern with AWS services.
Has anyone done something similar and have any recommendations? Am I working my way into a technical hole that will be difficult to manage later? I'm open to other paths as well if I can keep it to a consumption based pricing model. Thanks!
Event Sourcing has the applications publishing a notification to an event stream that other services in the platform can consume.
You'll want to be a little bit careful here -- there are at least two different definitions of "event sourcing" running around.
If you care about event sourcing, in the sense usually coupled with CQRS (Greg Young, et al), then your events are your book of record. The important complication this introduces is that your service needs to be able to lock the "event stream" when making changes to it (without that lock, you run into "lost edit" scenarios and have to clean up the mess).
So the "pointer to your current changes" needs to live in something that has transactions. DynamoDB should be fine for this (based on my memory of the event sourcing break out room at re:Invent 2017). In theory, you could have the lock in dynamo, which contains a pointer to an immutable document stored in S3. I haven't been able to persuade myself that the trade offs justify the complexity, but as best I can tell there's nothing in that architecture that violates physics and causality.
If your operations team isn't happy with Dynamo, another reasonable option is RDS; choose your preferred relational data engine, deploy an event storage schema to it, and off you go.
As for the pub sub part, I believe you to be on the right track with SNS. It's the right choice for "fanning out" messages from a publisher to multiple consumers. Yes, it doesn't support replay, but that's fine -- replay can happen by pulling events from the book of record. See the later parts of Greg Young's Polyglot Data talk. Yes, sometimes you will get messages on both the push channel and the pull channel, but that's fine; you already signed up for idempotent message handling when you decided a distributed architecture was a good idea.
Edit
Why the need to store a pointer in DynamoDB?
Because S3 doesn't offer you any locking; which means that on the unhappy path, where two copies of your logic are trying to write different versions of your data, you end up victim to the lost edit problem.
You could manage the situation with optimistic locking - something analogous to HTTP's conditional PUT; but S3 (last time I checked) doesn't support conditional modification.
You could use S3 as an object store for immutable documents, but now you need some mechanism to determine which document in S3 is the "current" one. If you try to implement that in S3, you run into the same lost edit problem all over again.
So you need a different tool to handle that part of the problem; some tool that is suitable for "state succession". So DynamoDB fits there.
If you are using DynamoDB for locking, can you also use it for event storage? I don't have enough laps to feel confident that I know the answer there. For small problems, I'm mostly confident that the answer is yes. For large problems...?
Possibly useful discussions:
Rich Hickey; The Language of the System
Kenneth Truyers; Git as a NoSql Database

Download all objects from S3 Bucket and send content to SQS

I am using python boto3 to get all the objects in a bucket but it returns the keys and not the content
I have a service which reads messages from SQS (a duplicate message is also present in s3 bucket) and does some operations. I have lost some sqs messages because of some failures and sqs 14 day policy.
The files have json data, with each file ranging from 4-8kb.
Now I want to re-drive all the objects from s3 to SQS.
Is there a way to get content of all files and then transfer them to SQS ?
Turning John Rotenstein's comment into an answer:
Is there a way to get content of all files and then transfer them to SQS ?
No. You will have to write something yourself, probably in the same way that you stored it in the first place. There is no automated method to move data from S3 to SQS

How can I get notification about new S3 objects?

I have a scenario where we have many clients uploading to s3.
What is the best approach to knowing that there is a new file?
Is it realistic/good idea, for me to poll the bucket ever few seconds?
UPDATE:
Since November 2014, S3 supports the following event notifications:
s3:ObjectCreated:Put – An object was created by an HTTP PUT operation.
s3:ObjectCreated:Post – An object was created by HTTP POST operation.
s3:ObjectCreated:Copy – An object was created an S3 copy operation.
s3:ObjectCreated:CompleteMultipartUpload – An object was created by the completion of a S3 multi-part upload.
s3:ObjectCreated:* – An object was created by one of the event types listed above or by a similar object creation event added in the future.
s3:ReducedRedundancyObjectLost – An S3 object stored with Reduced Redundancy has been lost.
These notifications can be issued to Amazon SNS, SQS or Lambda. Check out the blog post that's linked in Alan's answer for more information on these new notifications.
Original Answer:
Although Amazon S3 has a bucket notifications system in place it does not support notifications for anything but the s3:ReducedRedundancyLostObject event (see the GET Bucket notification section in their API).
Currently the only way to check for new objects is to poll the bucket at a preset time interval or build your own notification logic in the upload clients (possibly based on Amazon SNS).
Push notifications are now built into S3:
http://aws.amazon.com/blogs/aws/s3-event-notification/
You can send notifications to SQS or SNS when an object is created via PUT or POST or a multi-part upload is finished.
Your best option nowadays is using the AWS Lambda service. You can write a Lambda using either node.js javascript, java or Python (probably more options will be added in time).
The lambda service allows you to write functions that respond to events from S3 such as file upload. Cost effective, scalable and easy to use.
You can implement a pub-sub mechanism relatively simply by using SNS, SQS, and AWS Lambda. Please see the below steps. So, whenever a new file is added to the bucket a notification can be raised and acted upon that (everything is automated)
Please see attached diagram explaining the basic pub-sub mechanism
Step 1
Simply configure the S3 bucket event notification to notify an SNS topic. You can do this from the S3 console (Properties tab)
Step 2
Make an SQS Queue subscribed to this topic. So whenever an object is uploaded to the S3 bucket a message will be added to the queue.
Step 3
Create an AWS Lambda function to read messages from the SQS queue. AWS Lambda supports SQS events as a trigger. Therefore, whenever a message appears in the SQS queue, Lambda will trigger and read the message. Once successfully read the message it will be automatically deleted from the queue. For the messages that can't be processed by Lambda (erroneous messages) will not be deleted. So those messages will pile up in the queue. To prevent this behavior using a Dead Letter Queue (DLQ) is a good idea.
In your Lambda function, add your logic to handle what to do when users upload files to the bucket
Note: DLQ is nothing more than a normal queue.
Step 4
Debugging and analyzing the process
Make use of AWS Cloud watch to log details. Each Lambda function creates a log under log groups. This is a good place to check if something went wrong.

Can someone explain to me what Amazon Web Services components are used in a normal web service?

The web service that I want to run on AWS has to store and retrieve user data, present it to the user via a website, and needs to be able to parse the sitemaps of a few thousand sites every 10 min or so. Which components of AWS, such as S3, EC2, and CloudFront do I need to use. A short synopsis about the purpose of each component would be nice. :)
I particularly do not understand the purpose of the Simple Queue Service.
You might, for example, use EC2 (on-demand, scalable, VPS) to host the actual application and S3 (networked storage) to store the data. You would probably not need Cloudfront (geographically optimized content mirroring).
We use SQS (Simple Queue Service) to queue tasks we want performed asynchronously, i.e. without making the user wait for them to complete. As it turns out SQS becomes incredibly expensive if your site has even modest traffic, so we'll be handling queueing on one of our own boxes soon.
Another service you might want to look at is the Elastic Block Store (EBS), which provides persistent storage for an EC2 instance. The default storage that you get with an instance is not persisted if you shut down the instance, so I'd recommend storing all your critical data on EBS so that you can recover quickly if an instance goes down.
SimpleDB might also be useful for your service.
Have a look at the Wikipedia entry for AWS to learn more about each service.