Automatic sync S3 to local

Automatic sync S3 to local - amazon-s3

I want to auto sync my local folder with S3 bucket. I mean, when i change some file in S3, automatically this file would update in the local folder.
I tried using scheduler task and AWS cli but i think there is a better way to do that.
Do you know some app or better solution?
Hope you can help me.

#mgg, You can mount the s3 bucket to the local server using s3fs, this way you can sync your local changes to s3 bucket.

You could execute code (Lambda Functions) that responds to some events in a given bucket (such file change, deleted or created), so, you could have a simple http service that receives a post or a get request from that lambda and update your local data accordingly.
Read more:
Tutorial, Using AWS Lambda with Amazon S3
Working with Lambda Functions
The other approach (I don't recommend this) is to have some code "pulling" for changes in some bucket and then reflecting those changes locally. At first glance it looks easier to implement, but ... it get complicated when you try to handle not just creation events.
And of course for each cycle of your "pulling" component you have to check all elements in your local directory against all elements in the bucket, it is a performance killing approach!

Related

Read bucket object in CDK

In terraform to read an object from s3 bucket at the time of deployment I can use data source
data aws_s3_bucket_object { }
Is there a similar concept in CDK? I've seen various methods of uploading assets to s3, as well as importing an existing bucket, but not getting an object from the bucket. I need to read a configuration file from the bucket that will affect further deployment.

Its important to remember that CDK itself is not a deployment option. it can deploy, but the code you are writing in a cdk stack is the definition of your resources - not a method for deployment.
So, you can do one of a few things.
Use your SDK for your language to make a call to the s3 bucket and load the data directly. This is perfectly acceptable and an understood way to gather information you need before deployment - each time the stack Synths (which it does before every cdk deploy that code will run and will pull your data.
Use a CodePipeline to set up a proper pipeline, and give it two sources - one your version control repo and the second your s3 bucket:
https://docs.aws.amazon.com/codebuild/latest/userguide/sample-multi-in-out.html
The preferred way - drop the json file, and use Parameter Store. CDK contains modules that will create a token version of this parameter on synth, and when it deploys it will reference that properly back to the Systems Manager Parameter store
https://docs.aws.amazon.com/cdk/v2/guide/get_ssm_value.html
If your parameters change after deployment, you can have that as part of your cdk stack pretty easily (using cfn outputs). If they change in the middle/during deployment, you really need to be using a CodePipeline to manage these steps instead of just CDK.
Because remember: The cdk deploy option is just a convenience. It will execute everything and has no way to pause in the middle and execute specific steps. (other than a very basic, this depends on this resources)

Reading file inside S3 from EC2 instance

I would like to use AWS Data Pipeline to start an EC2 instance and then run a python script that is stored in S3.
Is it possible? I would like to make a single ETL step using a python script.
Is it the best way?

Yes, it is possible and relatively straight forward using Shell Command Activity.
I believe from the details you have provided so far, it seems to be the best way - as DataPipeline provisions the EC2 instance for you ondemand and shuts it down afterwards.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-shellcommandactivity.html
There is also a tutorial that you can follow to get acclimated to ShellCommndActivity of Data Pipeline.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-gettingstartedshell.html

yes, you can direct upload and backup your data in s3
http://awssolution.blogspot.in/2015/10/how-to-backup-share-and-organize-data.html

Event-driven Elastic Transcoder?

Is there a way to setup a transcoding pipeline on AWS such that it automatically transcodes any new files uploaded to a particular S3 bucket, and places them in another bucket?
I know there is a REST API, and that in theory the uploader could also issue a REST request to the transcoder after it has uploaded the file, but for a variety of reasons, this isn't really an option.

This can now be accomplished using AWS Lambda.
Lambda basically allows you to trigger/run scripts based off of events. You could easily create a Lambda script that runs as soon as a new file is uploaded to a designated s3 bucket. The script would then start a transcoding job for that newly uploaded video file.
This is literally one of the example use cases provided in the Lambda documentation.

Using data present in S3 inside EMR mappers

I need to access some data during the map stage. It is a static file, from which I need to read some data.
I have uploaded the data file to S3.
How can I access that data while running my job in EMR?
If I just specify the file path as:
s3n://<bucket-name>/path
in the code, will that work ?
Thanks

S3n:// url is for Hadoop to read the s3 files. If you want to read the s3 file in your map program, either you need to use a library that handles s3:// URL format - such as jets3t - https://jets3t.s3.amazonaws.com/toolkit/toolkit.html - or access S3 objects via HTTP.
A quick search for an example program brought up this link.
https://gist.github.com/lucastex/917988
You can also access the S3 object through HTTP or HTTPS. This may need making the object public or configuring additional security. Then you can access it using the HTTP url package supported natively by java.
Another good option is to use s3dist copy as a bootstrap step to copy the S3 file to HDFS before your Map step starts. http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

What I ended up doing:
1) Wrote a small script that copies my file from s3 to the cluster
hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt $DESTINATION_DIR_ON_HOST
2) Created bootstrap step for my EMR Job, that runs the script in 1).
This approach doesn't require to make the S3 data public.

Allowing users to download files as a batch from AWS s3 or Cloudfront

I have a website that allows users to search for music tracks and download those they they select as mp3.
I have the site on my server and all of the mp3s on s3 and then distributed via cloudfront. So far so good.
The client now wishes for users to be able to select a number of music track and then download them all in bulk or as a batch instead of 1 at a time.
Usually I would place all the files in a zip and then present the user a link to that new zip file to download. In this case, as the files are on s3 that would require I first copy all the files from s3 to my webserver process them in to a zip and then download from my server.
Is there anyway i can create a zip on s3 or CF or is there someway to batch / group files in to a zip?
Maybe i could set up an EC2 instance to handle this?
I would greatly appreciate some direction.
Best
Joe

I am afraid you won't be able to create the batches w/o additional processing. firing up an EC2 instance might be an option to create a batch per user

I am facing the exact same problem. So far the only thing I was able to find is Amazon's s3sync tool:
https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
In my case, I am using Rails + its Paperclip addon which means that I have no way to easily download all of the user's images in one go, because the files are scattered in a lot of subdirectories.
However, if you can group your user's files in a better way, say like this:
/users/<ID>/images/...
/users/<ID>/songs/...
...etc., then you can solve your problem right away with:
aws s3 sync s3://<your_bucket_name>/users/<user_id>/songs /cache/<user_id>
Do have in mind you'll have to give your server the proper credentials so the S3 CLI tools can work without prompting for usernames/passwords.
And that should sort you.
Additional discussion here:
Downloading an entire S3 bucket?

s3 is single http request based.
So the answer is threads to achieve the same thing
Java api - uses TransferManager
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/transfer/TransferManager.html
You can get great performance with multi threads.
There is no bulk download sorry.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas