Network connectivity between Amazon EC2 instances and Amazon S3 - amazon-s3

What is the network bandwidth between Amazon ec2 instances and Amazon S3? I am trying to figure out how long it would take me to copy data from Amazon S3 to Amazon EC2 (and vice versa)

This isn't published information, but ... it's fast.
On smaller instance classes, total Ethernet bandwidth available to the instance can easily be consumed by requests to S3, implying that the limitation isn't the connection to S3.
Provisioning a VPC endpoint for S3 access might also improve throughput to S3.
Bottom line, benchmark it. You will, of course, want to use a bucket that's provisioned in the same region as the instance, for both cost and performance reasons. Data transfer between EC2 and S3 is not billed within a region.

Related

How to set up AWS S3 bucket as persistent volume in on-premise k8s cluster

Since NFS has single point of failure issue. I am thinking to build a storage layer using S3 or Google Cloud Storage as PersistentVolumn in my local k8s cluster.
After a lot of google search, I still cannot find an way. I have tried using s3 fuse to mount volume to local, and then create PV by specifying the hotPath. However, a lot of my pods (for example airflow, jenkins), complained about no write permission, or say "version being changed".
Could someone help figuring out the right way to mount S3 or GCS bucket as a PersistenVolumn from local cluster without using AWS, or GCP.
S3 is not a file system and is not intended to be used in this way.
I do not recommend to use S3 this way, because in my experience any FUSE-drivers very unstable and with I/O operations you will easily ruin you mounted disk and stuck in Transport endpoint is not connected nightmare for you and your infrastructure users. It's also may lead to high CPU usage and RAM leakage.
Useful crosslinks:
How to mount S3 bucket on Kubernetes container/pods?
Amazon S3 with s3fs and fuse, transport endpoint is not connected
How stable is s3fs to mount an Amazon S3 bucket as a local directory

S3 to Redshift traffic go over internet?

We have data pipelines build to move data constantly from S3 to RedShift. I understand data is transferred over to redshift using copy command with HTTP/SSL protocol. My questions whether this traffic stays within VPC internal network or go over internet?
what if I'm transferring from S3 bucket to another S3 bucket in a different region, does it go over internet?
Amazon S3 has its own "back-end" connection to Amazon S3. The connection does not go via the VPC.
See: Can not copy data from s3 to redshift cluster in a private subnet
When transferring between Amazon S3 buckets in different regions, traffic will use Amazon-operated networks if they are available. (I'm not sure if data is transferred between two regions where there is no direct Amazon network connection.) However, the traffic is always encrypted.

Upload Speed of Files from EC2 to S3

How can we improve the upload speed of files from EC2 to S3, when the EC2 machine and S3 in different regions?
I have created a file which is of 1GB, and i need to upload the same to S3. Here the EC2 machine and S3 bucket were located in different regions(but Same Country)
Both were in US but the region is East and west
Anyone please assist on this
Unfortunately, there is not much you can do here. You are going to be limited to the available bandwidth between each datacenter.
You do have the option of moving your instance or bucket, to the other region. How much work this will be will depend on how much data you have in your bucket, or how you are currently using your instance.
A similar Question is this
There is a separate service Data pipeline, which provides reliable data transfer between S3 and EC2

Hadoop upload files from local machine to amazon s3

I am working on a Java MapReduce app that has to be able to provide an upload service for some pictures from the local machine of the user to an S3 bucket.
The thing is the app must run on an EC2 cluster, so I am not sure how I can refer to the local machine when copying the files. The method copyFromLocalFile(..) needs a path from the local machine which will be the EC2 cluster...
I'm not sure if I stated the problem correctly, can anyone understand what I mean?
Thanks
You might also investigate s3distcp: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduce to copy in a distributed manner—sharing the copy, error handling, recovery, and reporting tasks across several servers. S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services, particularly Amazon Simple Storage Service (Amazon S3). Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.
You will need to get the files from the userMachine to at least 1 node before you will be able to use them through a MapReduce.
The FileSystem and FileUtil functions refer to paths either on the HDFS or the local disk of one of the nodes in the cluster.
It cannot reference the user's local system. (Maybe if you did some ssh setup... maybe?)

Using Amazon cloud - do I need a NA location?

If I have my files hosted on Amazon S3, why would I need to use a cloud for North America? Wouldn't it just download from S3?
S3 has multiple regions. Currently there are four. Each S3 bucket is in a specific region. If you're using EC2 you will get the lowest prices (free bandwidth) and best performance (latency and bandwidth) if you use an S3 bucket in the same region as your EC2 instance.