I'm moving my backend to Heroku and noticed it recommends file storage on S3. Thus, I am a bit concerned about the performance, does it have any? Each request to my API will be calling files (small ones) and I need it to be as fast as possible.
Related
When I do aws s3 sync $PREFIX/some-dir $PREFIX/other-dir the transfer is quite slow. It appears limited by my network bandwidth.
However, when I do a "copy" operations through the S3 Web Console, it is much faster.
It seems like awscli is using my local machine as an intermediary for copying files, even though the source and destination are the same bucket. Is there a way to make awscli fast like the web console?
Since NFS has single point of failure issue. I am thinking to build a storage layer using S3 or Google Cloud Storage as PersistentVolumn in my local k8s cluster.
After a lot of google search, I still cannot find an way. I have tried using s3 fuse to mount volume to local, and then create PV by specifying the hotPath. However, a lot of my pods (for example airflow, jenkins), complained about no write permission, or say "version being changed".
Could someone help figuring out the right way to mount S3 or GCS bucket as a PersistenVolumn from local cluster without using AWS, or GCP.
S3 is not a file system and is not intended to be used in this way.
I do not recommend to use S3 this way, because in my experience any FUSE-drivers very unstable and with I/O operations you will easily ruin you mounted disk and stuck in Transport endpoint is not connected nightmare for you and your infrastructure users. It's also may lead to high CPU usage and RAM leakage.
Useful crosslinks:
How to mount S3 bucket on Kubernetes container/pods?
Amazon S3 with s3fs and fuse, transport endpoint is not connected
How stable is s3fs to mount an Amazon S3 bucket as a local directory
So I am starting to maxcdn cdn service,
right now I put as origin my s3 bucket Endpoint so when there is a miss in the cdn it will pull the file from s3.
Will it be smart to enable CloudFront for this bucket, and make maxcdn to pull the files from CloudFront? so it will be "double cdn"
like maxcdn(miss?)->CloudFront(miss?)->s3
Am I right in my assumption that is useless cause if there is a miss in maxcdn probably it also gonna miss CloudFront?
It seems a little unusual, but it's probably worth trying.
A sometimes overlooked benefit of pairing S3 with CloudFront is that, even when the request results in a cache miss (as well as for uploads, where relevant), the traffic between S3 (or EC2, any origin server inside AWS) and CloudFront is transported on a high-performance IP network owned and managed by Amazon.
The data has to be transported from S3 to the 3rd party CDN over the distance from S3 to the other CDN's requesting edge node, and running the requests through CloudFront will tend to cause the traffic to spend more of that distance on the global Amazon network than it might have if the requests were being made to S3 directly. This should mitigate some of the vagaries and variables involved in traversing large distances across the Internet, and should equate to some kind of performance advantage, though it might be somewhat subtle and difficult to quantify. It's hard to speculate.
CloudFront in the middle also changes the pricing you will pay for download bandwidth. If you connect directly the remote CDN to S3 as it's origin, your per-gigabyte AWS bandwidth charges for transport are based only on the S3 region, and this is a charge that is different among regions -- but your bucket is only in one region, and that region's charges will apply to all transfers.
But download transport between S3 and CloudFront is free, so your download bandwidth charges are based on the CloudFront edge location the remote CDN uses to fetch the content. This price also varies by region, and you can further control price/performance with the CloudFront pricing classes. In some cases, the bandwidth charges from CloudFront are lower than the charge for accessing S3 directly, but in other cases, they are higher.
Whether cache misses on the remote CDN are likely to correspond with cache misses on CloudFront is highly dependent on the global distribution of your site visitors, and of the number, distribution, and routing, of the remote CDN's edge locations, and how MaxCDN handles eviction of unpopular objects (and how CloudFront does, as well). MaxCDN appears to have fewer edge locations than CloudFront, so the odds of a MaxCDN cache miss coinciding with a CloudFront cache miss seems fairly high.
Still, it's easy enough to try it, and you can analyze the CloudFront stats and logs to determine how it's interacting.
We have a site where users upload files, some of them quite large. We've got multiple EC2 instances and would like to load balance them. Currently, we store the files on an EBS volume for fast access. What's the best way to replicate the files so they can be available on more than one instance?
My thought is that some automatic replication process that uploads the files to S3, and then automatically downloads them to other EC2 instances would be ideal.
EBS snapshots won't work because they replicate the entire volume, and we need to be able to replicate the directories of individual customers on demand.
You could write a shell script that would spawn s3cmd to sync your local filesystem with a S3 bucket whenever a new file is uploaded (or deleted). It would look something like:
s3cmd sync ./ s3://your-bucket/
Depends on what OS you are running on your EC2 instances:
There isn't really any need to add S3 to the mix unless you want to store them there for some other reason (like backup).
If you are running *nix the classic choice might be to run rsync and just sync between instances.
On Windows you could still use rsync or else SyncToy from Microsoft is a simple free option. Otherwise there are probably hundreds of commercial applications in this space...
If you do want to sync to S3 then I would suggest one of the S3 client apps like CloudBerry or JungleDisk, which both have sync functionality...
If you are running Windows it's also worth considering DFS (Distributed File System) which provides replication and is part of Windows Server...
The best way is to use the Amazon Cloud Front service. All of the replication is managed as part of the AWS. Content is served from several different availability zones, but does not require you to have EBS volumes in those zones.
Amazon CloudFront delivers your static and streaming content using a global network of edge locations. Requests for your objects are automatically routed to the nearest edge location, so content is delivered with the best possible performance.
http://aws.amazon.com/cloudfront/
Two ways:
Forget EBS, transfer the files to S3 and use S3 as your file-manager than EBS, add cloudfront and use the common-link everywhere.
Mount S3 bucket on any machines.
1. Amazon CloudFront is a web service for content delivery. It delivers your static and streaming content using a global network of edge locations.
http://aws.amazon.com/cloudfront/
2. You can mount S3 bucket on your linux machine. See below:
s3fs -
http://code.google.com/p/s3fs/wiki/InstallationNotes
- this did work for me. It uses FUSE file-system + rsync to sync the files
in S3. It kepes a copy of all
filenames in the local system & make
it look like a FILE/FOLDER.
That way you can share the S3 bucket on different machines.
This may be a silly question, but seeing as transfers between EC2 and S3 are free as long as within the same region, why isn't it possible to stream all transfers to and from S3 through EC2 and make the transfers completely free?
Specifically, I'm looking at Heroku, which is a Ruby on Rails hosting service run on EC2, where bandwidth is free. They already address uploads, and specifically note these are free to S3 if streamed through Heroku. However, I was wondering why the same trick wouldn't work in reverse, such that any files requested are streamed through the EC2?
If it is possible, would it be difficult to setup? I can't seem to find any discussion of this concept on Google.
The transfer is free, but it still costs money to store data on S3... Or am I missing something?