We have data pipelines build to move data constantly from S3 to RedShift. I understand data is transferred over to redshift using copy command with HTTP/SSL protocol. My questions whether this traffic stays within VPC internal network or go over internet?
what if I'm transferring from S3 bucket to another S3 bucket in a different region, does it go over internet?
Amazon S3 has its own "back-end" connection to Amazon S3. The connection does not go via the VPC.
See: Can not copy data from s3 to redshift cluster in a private subnet
When transferring between Amazon S3 buckets in different regions, traffic will use Amazon-operated networks if they are available. (I'm not sure if data is transferred between two regions where there is no direct Amazon network connection.) However, the traffic is always encrypted.
Related
Since NFS has single point of failure issue. I am thinking to build a storage layer using S3 or Google Cloud Storage as PersistentVolumn in my local k8s cluster.
After a lot of google search, I still cannot find an way. I have tried using s3 fuse to mount volume to local, and then create PV by specifying the hotPath. However, a lot of my pods (for example airflow, jenkins), complained about no write permission, or say "version being changed".
Could someone help figuring out the right way to mount S3 or GCS bucket as a PersistenVolumn from local cluster without using AWS, or GCP.
S3 is not a file system and is not intended to be used in this way.
I do not recommend to use S3 this way, because in my experience any FUSE-drivers very unstable and with I/O operations you will easily ruin you mounted disk and stuck in Transport endpoint is not connected nightmare for you and your infrastructure users. It's also may lead to high CPU usage and RAM leakage.
Useful crosslinks:
How to mount S3 bucket on Kubernetes container/pods?
Amazon S3 with s3fs and fuse, transport endpoint is not connected
How stable is s3fs to mount an Amazon S3 bucket as a local directory
i am working on a project that has a requirement to store scientific data on AWS S3 as raw data for the beginning of a data lake. we are planning JSON for application data and using S3 metadata to persist application metadata (JSON schema) and process metadata. at the moment, on site S3 is the only service that we have available to us from the AWS cloud.
the client would like a publish environment where they can get the raw data back as files. we would like to avoid building a custom catalog and security infrastructure.
i don't see anything about Apache Atlas that will connect directly to AWS S3. but we can put Apache Hive on top of AWS S3 and then put Apache Atlas and Ranger on top of that. but not sure if this is how we can publish the raw data from S3 or if that even works as Hive is more of a processing environment.
is it possible to use Apache Atlas and Ranger on top of AWS S3 directly?
What is the network bandwidth between Amazon ec2 instances and Amazon S3? I am trying to figure out how long it would take me to copy data from Amazon S3 to Amazon EC2 (and vice versa)
This isn't published information, but ... it's fast.
On smaller instance classes, total Ethernet bandwidth available to the instance can easily be consumed by requests to S3, implying that the limitation isn't the connection to S3.
Provisioning a VPC endpoint for S3 access might also improve throughput to S3.
Bottom line, benchmark it. You will, of course, want to use a bucket that's provisioned in the same region as the instance, for both cost and performance reasons. Data transfer between EC2 and S3 is not billed within a region.
How can we improve the upload speed of files from EC2 to S3, when the EC2 machine and S3 in different regions?
I have created a file which is of 1GB, and i need to upload the same to S3. Here the EC2 machine and S3 bucket were located in different regions(but Same Country)
Both were in US but the region is East and west
Anyone please assist on this
Unfortunately, there is not much you can do here. You are going to be limited to the available bandwidth between each datacenter.
You do have the option of moving your instance or bucket, to the other region. How much work this will be will depend on how much data you have in your bucket, or how you are currently using your instance.
A similar Question is this
There is a separate service Data pipeline, which provides reliable data transfer between S3 and EC2
I am working on a Java MapReduce app that has to be able to provide an upload service for some pictures from the local machine of the user to an S3 bucket.
The thing is the app must run on an EC2 cluster, so I am not sure how I can refer to the local machine when copying the files. The method copyFromLocalFile(..) needs a path from the local machine which will be the EC2 cluster...
I'm not sure if I stated the problem correctly, can anyone understand what I mean?
Thanks
You might also investigate s3distcp: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
Apache DistCp is an open-source tool you can use to copy large amounts of data. DistCp uses MapReduce to copy in a distributed manner—sharing the copy, error handling, recovery, and reporting tasks across several servers. S3DistCp is an extension of DistCp that is optimized to work with Amazon Web Services, particularly Amazon Simple Storage Service (Amazon S3). Using S3DistCp, you can efficiently copy large amounts of data from Amazon S3 into HDFS where it can be processed by your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3.
You will need to get the files from the userMachine to at least 1 node before you will be able to use them through a MapReduce.
The FileSystem and FileUtil functions refer to paths either on the HDFS or the local disk of one of the nodes in the cluster.
It cannot reference the user's local system. (Maybe if you did some ssh setup... maybe?)