Copying files in S3 is slow from CLI - amazon-s3

When I do aws s3 sync $PREFIX/some-dir $PREFIX/other-dir the transfer is quite slow. It appears limited by my network bandwidth.
However, when I do a "copy" operations through the S3 Web Console, it is much faster.
It seems like awscli is using my local machine as an intermediary for copying files, even though the source and destination are the same bucket. Is there a way to make awscli fast like the web console?

Related

Heroku to AWS S3 bucket connection time

I'm moving my backend to Heroku and noticed it recommends file storage on S3. Thus, I am a bit concerned about the performance, does it have any? Each request to my API will be calling files (small ones) and I need it to be as fast as possible.

Airflow best practice : s3_to_sftp_operator instead of running aws cli?

What would be the best solution to transfer files between s3 and an EC2 instance using airflow?
After research i found there was a s3_to_sftp_operator but i know it's good practice to execute tasks on the external systems instead of the airflow instance...
I'm thinking about running a bashoperator that executes an aws cli on the remote ec2 instance since it respects the principle above.
Do you have any production best practice to share about this case ?
The s3_to_sftp_operator is going to be the better choice unless the files are large. Only if the files are large would I consider a bash operator with an ssh onto a remote machine. As for what large means, I would just test with the s3_to_sftp_operator and if the performance of everything else on airflow isn't meaningfully impacted then stay with it. I'm regularly downloading and opening ~1 GiB files with PythonOperators in airflow on 2 vCPU airflow nodes with 8 GiB RAM. It doesn't make sense to do anything more complex on files that small.
The best solution would be not to transfer the files, and most likely to get rid of the EC2 while you are at it.
If you have a task that needs to run on some data in S3, then just run that task directly in airflow.
If you can't run that task in airflow because it needs vast power or some weird code that airflow won't run, then have the EC2 instance read S3 directly.
If you're using airflow to orchestrate the task because the task is watching the local filesystem on the EC2, then just trigger the task and have the task read S3.

Copy objects from S3 to google cloud storage using aws-cli

Is this possible to access Google Cloud Storage using aws CLI?
Google Cloud Platform have support to copy files from S3 to Google Cloud Storage using gsutil with the following CLI.
gsutil -m cp -R s3://bucketname gs://bucketname
But I need to do this with aws CLI instead of gsutil.
I am not aware of any solution from the AWS side, but unless you have a special reason not to use gsutil or other Google solution, you may consider using Google Cloud Storage Transfer Service instead. This service is recommended when transferring data from Amazon S3 buckets.
Compared with simply using gsutil, or other CLI tools out there, Google Cloud Storage Transfer has several nice features like the possibility to schedule one-time or recurring transfers, where you can use advanced filters. Also, you can indicate if you want the source objects to be deleted after transferring them, and even synchronize the destination bucket with the source one, deleting existing objects if they don't have a corresponding object in the source.
You can schedule transfers from the GCP Console or using the XML and JSON API.

How to replicate Amazon EBS to S3?

We have a site where users upload files, some of them quite large. We've got multiple EC2 instances and would like to load balance them. Currently, we store the files on an EBS volume for fast access. What's the best way to replicate the files so they can be available on more than one instance?
My thought is that some automatic replication process that uploads the files to S3, and then automatically downloads them to other EC2 instances would be ideal.
EBS snapshots won't work because they replicate the entire volume, and we need to be able to replicate the directories of individual customers on demand.
You could write a shell script that would spawn s3cmd to sync your local filesystem with a S3 bucket whenever a new file is uploaded (or deleted). It would look something like:
s3cmd sync ./ s3://your-bucket/
Depends on what OS you are running on your EC2 instances:
There isn't really any need to add S3 to the mix unless you want to store them there for some other reason (like backup).
If you are running *nix the classic choice might be to run rsync and just sync between instances.
On Windows you could still use rsync or else SyncToy from Microsoft is a simple free option. Otherwise there are probably hundreds of commercial applications in this space...
If you do want to sync to S3 then I would suggest one of the S3 client apps like CloudBerry or JungleDisk, which both have sync functionality...
If you are running Windows it's also worth considering DFS (Distributed File System) which provides replication and is part of Windows Server...
The best way is to use the Amazon Cloud Front service. All of the replication is managed as part of the AWS. Content is served from several different availability zones, but does not require you to have EBS volumes in those zones.
Amazon CloudFront delivers your static and streaming content using a global network of edge locations. Requests for your objects are automatically routed to the nearest edge location, so content is delivered with the best possible performance.
http://aws.amazon.com/cloudfront/
Two ways:
Forget EBS, transfer the files to S3 and use S3 as your file-manager than EBS, add cloudfront and use the common-link everywhere.
Mount S3 bucket on any machines.
1. Amazon CloudFront is a web service for content delivery. It delivers your static and streaming content using a global network of edge locations.
http://aws.amazon.com/cloudfront/
2. You can mount S3 bucket on your linux machine. See below:
s3fs -
http://code.google.com/p/s3fs/wiki/InstallationNotes
- this did work for me. It uses FUSE file-system + rsync to sync the files
in S3. It kepes a copy of all
filenames in the local system & make
it look like a FILE/FOLDER.
That way you can share the S3 bucket on different machines.

Hadoop upload files from local machine to amazon s3

I am working on a Java MapReduce app that has to be able to provide an upload service for some pictures from the local machine of the user to an S3 bucket.
The thing is the app must run on an EC2 cluster, so I am not sure how I can refer to the local machine when copying the files. The method copyFromLocalFile(..) needs a path from the local machine which will be the EC2 cluster...
I'm not sure if I stated the problem correctly, can anyone understand what I mean?
Thanks
You might also investigate s3distcp: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduce to copy in a distributed manner—sharing the copy, error handling, recovery, and reporting tasks across several servers. S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services, particularly Amazon Simple Storage Service (Amazon S3). Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.
You will need to get the files from the userMachine to at least 1 node before you will be able to use them through a MapReduce.
The FileSystem and FileUtil functions refer to paths either on the HDFS or the local disk of one of the nodes in the cluster.
It cannot reference the user's local system. (Maybe if you did some ssh setup... maybe?)