I am using python boto3 to get all the objects in a bucket but it returns the keys and not the content
I have a service which reads messages from SQS (a duplicate message is also present in s3 bucket) and does some operations. I have lost some sqs messages because of some failures and sqs 14 day policy.
The files have json data, with each file ranging from 4-8kb.
Now I want to re-drive all the objects from s3 to SQS.
Is there a way to get content of all files and then transfer them to SQS ?
Turning John Rotenstein's comment into an answer:
Is there a way to get content of all files and then transfer them to SQS ?
No. You will have to write something yourself, probably in the same way that you stored it in the first place. There is no automated method to move data from S3 to SQS
Related
strong textI create a aws Cross-Account Log Data Sharing with Subscriptions.
By follow this link
After create kinesis stream create Kinesis Data Firehose delivery streams to save logs in s3 bucket.
logs files creating in S3 bucket but in encrypted form .
And at sender side no KMS key id ..
How can i see the logs..
Also not able to decrypt in base64 manually..
Updated:
I found that logs store in S3 bucket have "Content-Type application/octet-stream". when i update content-type to "text/plain" ..
is there any way set in bucket level content type or configure in kinesis data stream or firehose
Is there any way to set content-type kinesis stream or set the default content-type for s3 folder?
The data you posted appears to be compressed (I'd need a short file sample to be able to say that for certain). If I were you I'd look into compression settings for the log stream.
Here, it references some different compressions available for data: https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html
In a pinch, if you save those data in a file and give it the ".gz" extension, does the file become readable? (I'm not too hopeful since it says that the default compression scheme is Snappy and not GZIP, and I might be mistaken but I think I see a ZIP preamble after some kind of header in your screenshot).
I read a while about how to upload my S3 data to Redshift, COPY command, Glue, etc.
My pipeline is almost entirely in NIFI something like:
extract_data->insert to S3->excecute Lamda process to transform the data or enrich it using Athena, in 2 or 3 stages, into another S3 bucket (lets call it processed bucket).
Now I want to continue this pipeline, loading the data from the processed bucket and inserting it to redshift, I have an empty table created for this.
The idea is to add incrementally in some tables and in others to delete all the data loaded that day and reload it.
Can anyone give me a hint of where to start?
Thank you!
When data lands in your "processed bucket", you can fire a lambda function, that triggers a flow in Apache NiFi by calling an HTTP webhook. To expose such a webhook you use one of the following processors:
ListenHTTP
Starts an HTTP Server and listens on a given base path to transform
incoming requests into FlowFiles. The default URI of the Service will
be http://{hostname}:{port}/contentListener. Only HEAD and POST
requests are supported. GET, PUT, and DELETE will result in an error
and the HTTP response status code 405.
HandleHttpRequest
Starts an HTTP Server and listens for HTTP Requests. For each request,
creates a FlowFile and transfers to 'success'. This Processor is
designed to be used in conjunction with the HandleHttpResponse
Processor in order to create a Web Service
So the flow would be ListenHTTP -> FetchS3 -> Process -> PutSQL (with Redshift connection pool). The lambda function would call GET my-nifi-instance.com:PORT/my-webhook, such that ListenHTTP creates a flowfile for the incoming request.
Wondering if this is recommended.
With a master lambda, subscribe to an SNS topic where I publish a message which contains a list of source/destination pairs for s3. Say 100 per message.
The master lambda with then loop though those pairs and call a slave worker for each item in the list which will do the copy of S3 objects from source to destination.
I was originally trying to use SQS but SQS is not an event source for lambda. Cloudwatch Events are too murky as how you pass actual data as payload.
So wondering if my approach above is valid and will hold up, or if there's a better alternative.
Thanks
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Notification of new S3 objects
We've got an app that stores user data on S3. The part of our app that handles the uploads is decoupled from the part that processes the data. In some cases, the user will be able to upload data directly to S3 without going through our app at all (this may happen if they have their own S3 account and supply us with credentials).
Is it possible to get notified whenever the contents of an S3 bucket change? It would be cool if somehow a message could get sent that says "this file was added/updated/deleted: foo".
Short of that, is there some timestamp somewhere I could poll that would tell the last time the bucket was updated?
If I can't do either of these things, then the only alternative is the crawl the entire bucket and look for changes. This will be slow and expensive.
Update 2014-11:
As Alan Illing points out in the comments, AWS now supports notifications from S3 to SNS, which can be forwarded automatically to SQS: http://aws.amazon.com/blogs/aws/s3-event-notification/
S3 can also send notifications to AWS Lambda to run your own code directly.
Original response that predicted S3->SNS notifications:
If Amazon supported this, they would use SNS to send out notifications that an object has been added to a bucket. However, at the moment, the only bucket event supported by S3 and SNS is to notify you when Amazon S3 detects that it has lost all replicas of a Reduced Redundancy Storage (RRS) object and can no longer service requests for that object.
Here's the documentation on the SNS events supported by S3:
http://docs.amazonwebservices.com/AmazonS3/latest/dev/NotificationHowTo.html
Based on the way that the documentation is written, it looks like Amazon has ideas for other notification events to add (like perhaps your idea for finding out when new keys have been added).
Given that it isn't supported directly by Amazon, the S3 client that uploads the object to S3 will need to trigger the notification, or you will need to do some sort of polling.
Custom event notification for uploads to S3 could be done using SNS if you like to get near-real-time updates for processing, or it can be done through SQS if you like to let the notifications pile up and process them out of a queue at your own pace.
If you are polling, you could reduce the number of keys you need to request by having the client upload with a prefix of, say, "unprocessed/..." followed by the unique key. Your polling software can then query just S3 keys starting with that prefix. When it is ready to process, it could change the key to "processing/..." and then later to "processed/..." or whatever. Objects in S3 are currently renamed by copy+delete operations performed by S3.
I have a scenario where we have many clients uploading to s3.
What is the best approach to knowing that there is a new file?
Is it realistic/good idea, for me to poll the bucket ever few seconds?
UPDATE:
Since November 2014, S3 supports the following event notifications:
s3:ObjectCreated:Put – An object was created by an HTTP PUT operation.
s3:ObjectCreated:Post – An object was created by HTTP POST operation.
s3:ObjectCreated:Copy – An object was created an S3 copy operation.
s3:ObjectCreated:CompleteMultipartUpload – An object was created by the completion of a S3 multi-part upload.
s3:ObjectCreated:* – An object was created by one of the event types listed above or by a similar object creation event added in the future.
s3:ReducedRedundancyObjectLost – An S3 object stored with Reduced Redundancy has been lost.
These notifications can be issued to Amazon SNS, SQS or Lambda. Check out the blog post that's linked in Alan's answer for more information on these new notifications.
Original Answer:
Although Amazon S3 has a bucket notifications system in place it does not support notifications for anything but the s3:ReducedRedundancyLostObject event (see the GET Bucket notification section in their API).
Currently the only way to check for new objects is to poll the bucket at a preset time interval or build your own notification logic in the upload clients (possibly based on Amazon SNS).
Push notifications are now built into S3:
http://aws.amazon.com/blogs/aws/s3-event-notification/
You can send notifications to SQS or SNS when an object is created via PUT or POST or a multi-part upload is finished.
Your best option nowadays is using the AWS Lambda service. You can write a Lambda using either node.js javascript, java or Python (probably more options will be added in time).
The lambda service allows you to write functions that respond to events from S3 such as file upload. Cost effective, scalable and easy to use.
You can implement a pub-sub mechanism relatively simply by using SNS, SQS, and AWS Lambda. Please see the below steps. So, whenever a new file is added to the bucket a notification can be raised and acted upon that (everything is automated)
Please see attached diagram explaining the basic pub-sub mechanism
Step 1
Simply configure the S3 bucket event notification to notify an SNS topic. You can do this from the S3 console (Properties tab)
Step 2
Make an SQS Queue subscribed to this topic. So whenever an object is uploaded to the S3 bucket a message will be added to the queue.
Step 3
Create an AWS Lambda function to read messages from the SQS queue. AWS Lambda supports SQS events as a trigger. Therefore, whenever a message appears in the SQS queue, Lambda will trigger and read the message. Once successfully read the message it will be automatically deleted from the queue. For the messages that can't be processed by Lambda (erroneous messages) will not be deleted. So those messages will pile up in the queue. To prevent this behavior using a Dead Letter Queue (DLQ) is a good idea.
In your Lambda function, add your logic to handle what to do when users upload files to the bucket
Note: DLQ is nothing more than a normal queue.
Step 4
Debugging and analyzing the process
Make use of AWS Cloud watch to log details. Each Lambda function creates a log under log groups. This is a good place to check if something went wrong.