Amazon S3: when/why [closed] - amazon-s3

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
So, I have a dedicated server. I host about dozen or so small sites.
Is there a real benefit in using S3(or Mosso) for my image and static file hosting? My server has more than enough disk space, or am I completely missing the point of S3?
I keep reading about how wonderful and cheap it is, and I ask myself "self, why aren't you using this" and the reply is always "why?"

if you're running within the included storage and bandwidth of your server and your needs are being served well, you are already doing the simplest thing that is working for you and that is where you should always start. Off the top of my head I can think of a couple reasons why you may want to move some storage to S3 in the future:
Your storage or bandwidth needs grow beyond what you have and S3 is cheaper than upgrading your current solution
You move to a multiple-dedicated-server solution for failover/performance reasons and want to be able to store your assets in a single shared location
Your bandwidth needs are highly
variable (so you can avoid a monthly
fee when you're not getting traffic) [Thanks Jim, from the comments]

If you run an entire website off of a single machine, and that machine is more than enough to handle your site, then kudos, images are not a bottleneck that needs solving right now. Forget about S3 for now.
However, as your server gets busier, you will want your server to be spending all of its time doing server things. Transferring static content like flat HTML files and images is an easy, dumb job, and wasting precious active connections, bandwidth, and CPU cycles on them is no good. By switching to S3, your server can concentrate on doing what's important, which is whatever your program actually DOES.
S3 also has benefits of being distributed around and attached to what's probably a fatter pipe than your server, which means the images will show up slightly more quickly on your client's machines, so that's an added bonus.
S3 is also backed up, which means that it makes for a pretty nice place to store pretty much any private data under the sun, in addition to stuff that you want to serve to others (although don't confuse the permissions settings between those two things -- in fact, you may want to use separate accounts entirely).
S3 is also nigh-infinite, which means that if you want to let users upload files to your site (profile images, attachments, etc), S3 is a great choice so that you don't have to constantly worry if your server is going to run out of disk space (obligatory $$$ warning here).
But like I said at the top, if you're a one-server setup with a handful of users, none of this really matters. It's a tool like any other, and it may not be something you need yet.

It's simply a matter of doing the numbers: given a certain amount of traffic for a set of files, you can calculate exactly how much hosting those file on S3 would cost you, and you should be able to do the same for your current provider. If the number is lower for S3, there you have your reason.
An added benefit is that S3 scales pretty much linearly with traffic and you pay only for what you actually use, whereas most providers charge you a flat fee no matter how little traffic you actially have, and some will gouge you badly if you ever exceed the maximum traffic included in the flat fee.
Better speed and availability could be an additional benefit.
Basically, if you have a site that could potentially incur wildly disparate traffic, then using S3 for its images and other static files means that if you're hit by the Slashdot effect, the site has a much better chance of staying reachable, and you have a much better chance of avoiding nasty surprises concerning excess traffic fees.

The advantages of Amazon S3 are reliability, scalability, speed and cost. Here is some info on each.
Reliability: Amazon stores your data in multiple data centers. If there was a disaster and one data center was destroyed your content would continue to be served from the second data center. It’s very unlikely that data you upload to Amazon would ever be lost.
Scalability: If one of your web sites becomes popular and millions of people visit the site, your web server will not be able to handle the load. In comparison when you upload your files to Amazon they are stored in multiple locations. If the load on your content grows your files are automatically replicated to more servers so your files will always be available.
Speed: Amazon has a service called CloudFront that works in conjunction with Amazon S3. When you activate CloudFront on your S3 content your content is moved to edge locations. These are servers that make your content available for high speed transfer.
Cost: With Amazon S3 you only pay for what you use. If you have a few files that get little traffic you will only pay a few cents a month.
SprightlySoft has a blog post which gives even more reasons why Amazon S3 is great. Read it at http://sprightlysoft.com/blog/?p=8

If you're hosting a high-traffic site, the bandwidth cost (and latency issues) of hosting images yourself makes S3 and other services like Akami attractive. For a low-traffic site, it probably isn't an issue.

I'd say that there's no reason if your base hosting plan provides enough space/bandwidth. Where I think it's useful is when your file transfers become enough that you have to look at buying an add-on of storage/bandwidth from the provider -- in that case, S3 may be a viable alternative. But if I'm paying $X/month and not using all of the storage, there's no upside to it.
On the other hand, if your capacity planning calls for you to someday exceed the provider's limits, S3 may be a good solution from the start so you don't have files being served from multiple places.

I would second the mention of "redundancy" -- you can count on any content that's in S3 to be distributed to multiple data centers, and effectively been very much always accessible for anyone with functioning network connection.
Cost may be another factor: data transfer rates for S3 are quite competitive.
And speed is the last one: you can access data VERY fast from S3. But that's more of an issue for data other than browser-viewable images.

For small sites, S3 or Mosso may not be that reasonable for image hosting, but if you have any video files (.wmv, .flv, etc...) or large downloads (app distributions, etc..), I'd still put them on S3 or Mosso to save potential bandwidth spikes if for some odd reason, your content becomes wildly popular.

You write:
My server has more than enough disk space, or am I completely missing the point of S3?
You are not missing the point if what you have on you server is write-once read-less-than-once stuff, such as disaster-recovery backups (which you hope will be read-never), because transfer times will not matter. The point of S3 is delivery speed.
First, S3 distributes your content geographically. End users benefit from shorter paths.
Second, S3 can act as a BitTorrent seed, which not only conserves your bandwidth, it means your most popular content will be distributed faster because it can take advantage of the ad-hoc swarm. There are reports on the AWS Discussion Forums that S3 support of the BitTorrent protocol is "very, very spotty." I have not tested it myself.

Many of you won't have this problem, but if you (and your web server) are located in Australia (read: the 3rd world of the Internet), you run into the issue that S3 does not have geographically close locations, which means there will be a higher latency on your images and other static content. Scalable: yes. Fast: no.

From what I hear, besides low cost, the main advantage is the ease of backup from an EC2 setup.
Link..
http://groups.drupal.org/node/2383

Speed might be the only benefit. If your dedicated server is simply networked through your ISP (which may well throttle upstream speeds even if downstream speeds are high) then you might find that your sites are often slow to load. If so, then S3 or another dedicated server provider can help. Other than that, I can think of absolutely no reason why Amazon's service would be more appropriate for you - especially with simple, static sites.

It's not really directly related to your actual hosting of web sites, but it's certainly an important part of it, especially if the sites don't belong to you alone -- S3 is a great backup solution. There are tools such as duplicity that can automatically and efficiently back things up onto S3 for you, and it's extremely cheap for this purpose. I back up a fairly large amount of data for less than $1/month.

Besides the Fat Pipe and Local Delivery arguments for S3 there is also the manner of a single server does not function optimally when its functioning both as a db server and as a file server. If your running any sort of db I would suggest offloading all your static files to s3. The cost is trivial and you will see pretty big performance gains on page load.

Related

Static files as API GET targets

I'm creating a RESTful backend API for eventual use by a phone app, and am toying with the idea of making some of the API read functions nothing more than static files, created and periodically updated by my server-side code, that the app will simply GET directly.
Is this a good idea?
My hope is to significantly reduce the CPU and memory load on the server by not requiring any code to run at all for many of the API calls. However, there could potentially be a huge number of these files (at least one per user of the phone app, which will be a public app listed in the app stores that I naturally hope will get lots of downloads) and I'm wondering if that alone will lead to latency issues I'm trying to avoid.
Here are more details:
It's an Apache server
The hardware is a hosting provider's VPS with about 1gb memory and 20gb free disk space
The average file size (in terms of content and not disk footprint) will probably be < 1kb
I imagine my server-side code might update a given user's data once a day or so at most.
The app will probably do GETs on these files just a few times a day. (There's no real-time interaction going on.)
I might password protect the directory the files will be in at the .htaccess level, though there's no personal or proprietary information in any of the files, so maybe I don't need to, but if I do, will that make a difference in terms of the main question of feasibility and performance?
Thanks for any help you can give me.
This is generally a good thing to do: anything that can be static rather than dynamic is a win for performance and cost (it's why we do caching!), but the main issue with with authorization (which you'll still need to do for each incoming request).
You might also want to consider using a cloud service for storage of the static data (e.g., Amazon S3 or Google Cloud Storage). There are neat ways to provide temporary authorized URLs that you can pass to users so that they can read the data for a short time and then must re-authorize to continue having access.

Good idea to host data that will be downloaded internationally using S3?

I don't have any experience regarding server hosting performance and how slow it gets so I wanted to ask this question.
My situation is, I want to host a ~1MB data file that needs to be downloaded by clients occasionally (once every 2-3 days). Of course I would like to minimize costs as long as it does not hurt user experience too much. I have data to indicate that I have clients globally.
I wanted to ask what the ballpark figure would be for the amount of time it would take to download a file of this size from other parts of the world (data is hosted in the US). Does anyone have any idea, for instance, how long it would take to download a 1MB file from locations such as Japan?
In case people are wondering, I personally would consider it OK if it takes under 10s to download in most parts of the world.
The first thing to do when you don't know how well something works... is to try it. Create buckets in all of the regions, store a file, and then download it and see.
The official AWS-centric answer for global content distribution is to connect a CloudFront distribution to an S3 bucket, and set things up so that your content is downloaded from S3 via CloudFront. This will tend to improve download speeds more when the user is distant from the bucket, even if the content isn't cached at a CloudFront edge, because most of the distance the download has to travel, it will be traveling on the AWS "Edge Network," a global network connecting CloudFront to the AWS regions, with fewer unknowns than the Internet at large between here and wherever.
I have a global client base, but -- for example -- my shopping pages' catalog images are stored in S3 in Oregon (us-west-2), but with links pointing to CloudFront.
Interestingly, the pricing for using both services together sometimes works out a little bit less expensive than using only S3. A possible explanation for this is that edge network egress traffic represents a lower cost to AWS and the rates are set accordingly. It's not a major difference, but once you understand the pricing tables, you'll see it.
1MB in 10s equals 800kbps. I'd be very surprised if any reputable hosting provider couldn't keep up with that speed of delivery. Looking at Akamai rankings (2015)*, in Japan (as in your example) the average user's speed is 15Mbps: your file would then be downloaded in 0.53 seconds.
( *Looking at the rankings, keep in mind that in countries where fast internet infra is yet to be ubiquitous, the "average speed" will be an average of fast corporate pipes and other premium links, with actual mainstream users having substantially slower speeds.)
Then in most cases, this will be up to the user's connection speed, and further, their ISP's international links, which can be much slower than their national or regional pipes. More so in countries with less developed internet infrastructure, where operators are cutting costs and corners.
In deciding if you need to deploy S3 or other CDN solutions, or no extra solutions at at all, you'll have to start with mapping up your user demographics. If there's a substantial sector from far-away countries with weak net infra, it makes sense. Otherwise, it doesn't seem likely that your target speed of 1MB/10s wouldn't be matched even without a special means of delivery.
If you have some but not substantial traffic from countries/regions where you reckon int'l traffic might be slower, and if you want to eliminate extra costs, I figure your users will survive even if it takes 15-20 seconds once in a blue moon as their speeds fluctuate. (This is opinion-based relative to how picky your users are!) In such a case, I'd only bother with a CDN if I wanted to improve speeds across the board, e.g. for all requests for static resources, not just a single file requested every couple of days. Would make a more substantial contribution towards the general user experience.

Microsoft Azure Blob Storage Upload Performance

I am running an Azure web role, which is storing very small blobs into Azure storage. (Blob upload is being done from the server, not from the browser.) I have searched stack overflow and the rest of the internet for tips on optimizing blob storage performance, and I believe I've checked and implemented all of the usual suspects: uploading async, allowing unlimited outgoing web connections (which now seems to be the default setting on web roles and no longer needs to be explicitly set in web.config or in code).
Tweaking the number of concurrent uploads I allow makes some difference, but regardless of what I've tried, I seem to max out at around 1,000 blob uploads per second. This is when running in the Azure web role, in the same region as the storage account (East US). My rate when running this from home over a good internet connection isn't much less, ~700 blobs/sec, which seems to tell me that it's not the network latency that's limiting the rate, it's the actual processing time of the storage service.
I wouldn't normally consider these rates horrible for this kind of a service, but I've read that Microsoft boasts a rate of ~20,000 storage transactions per second, so I've been a little disappointed with these results.
I'd like to get some feedback from those who have really tried to push the limits of blob storage. Does ~1000 small uploads per second sound about right? Or is there possibly something else I should be doing to improve this? I'll post the code if I need to, but I'd rather not receive speculative answers, I'd like to hear from developers who can either confirm that my results are reasonable, or that they've seen much higher throughput.
I should add that I'm currently running this in a small web role. I've tried it also in a medium web role, and didn't see any significant difference.
EDIT:
After a few days of development and testing, my upload rate seemed to suddenly increase. Not by a lot, but maybe by another ~200 per second. In looking around the web, I noticed a comment in the Azure documentation stating "A storage account scales automatically as usage increases." So I'm wondering if it really is capable of much higher rates, but will not automatically scale up until it sees sustained period of high volume. Some confirmation of that would also be greatly appreciated.
Depending on how small your requests are the problem might be caused by Nagle’s Algorithm is Not Friendly towards Small Requests - although usually I see that with queues / table operations. Try disabling Nagle's and let me know if that makes any difference. As an fyi, you have to disable it prior to establishing the connection otherwise the changes will not take effect.
Jason

Simple DB and Amazon S3 performance tools

Is there any benchmark tools that i can use to test the Amazon Simple DB performance and Amazon S3 performance?
help needed please.
Its going to depend on you usage and whether you're running in EC2 or not. There are some benchmarks somewhere for S3 access from EC2, but your mileage will vary with object sizes, the SDK library you're using and where you're accessing from.
Roll your own tests and then you'll know that you're testing something close to your end goal...
You need to write your own code that approximates what you want to do.
Having said that: In my experience, S3 is about as fast as your connection. You may have to upload/download more than one item at a time to hit your local bandwidth limit, but you can get there.
Listing performance is also pretty good on S3, but the results are uncompressed XML, so they are little large. If you want to do 'something' to say a million files, you need to run several requests in parallel. This goes for SimpleDb too. The number of requests 'in flight' that works best is a mix of ping, bandwidth, AWS service response and other factors.
SimpleDB on the other hand I find to be pretty slow for many tasks. It totally depends on your needs, though. Selecting a record and getting back the attributes when you know the db item name is usually ping time limited, but searching with the %like% operator is usually quite slow (seconds is easy to hit).
Add to this that its all much faster if you are running on EC2 vs a local machine, and also add in the delay/bandwidth if your app is in say Singapore and you are trying to use the US Standard location to store everything. There is just too much to figure in.

Planning the development of a scalable web application

We have created a product that potentially will generate tons of requests for a data file that resides on our server. Currently we have a shared hosting server that runs a PHP script to query the DB and generate the data file for each user request. This is not efficient and has not been a problem so far but we want to move to a more scalable system so we're looking in to EC2. Our main concerns are being able to handle high amounts of traffic when they occur, and to provide low latency to users downloading the data files.
I'm not 100% sure on how this is all going to work yet but this is the idea:
We use an EC2 instance to host our admin panel and to generate the files that are being served to app users. When any admin makes a change that affects these data files (which are downloaded by users), we make a copy over to S3 using CloudFront. The idea here is to get data cached and waiting on S3 so we can keep our compute times low, and to use CloudFront to get low latency for all users requesting the files.
I am still learning the system and wanted to know if anyone had any feedback on this idea or insight in to how it all might work. I'm also curious about the purpose of projects like Cassandra. My understanding is that simply putting our application on EC2 servers makes it scalable by the nature of the servers. Is Cassandra just about keeping resource usage low, or is there a reason to use a system like this even when on EC2?
CloudFront: http://aws.amazon.com/cloudfront/
EC2: http://aws.amazon.com/cloudfront/
Cassandra: http://cassandra.apache.org/
Cassandra is a non-relational database engine and if this is what you need, you should first evaluate Amazon's SimpleDB : a non-relational database engine built on top of S3.
If the file only needs to be updated based on time (daily, hourly, ...) then this seems like a reasonable solution. But you may consider placing a load balancer in front of 2 EC2 images, each running a copy of your application. This would make it easier to scale later and safer if one instance fails.
Some other services you should read up on:
http://aws.amazon.com/elasticloadbalancing/ -- Amazons load balancer solution.
http://aws.amazon.com/sqs/ -- Used to pass messages between systems, in your DA (distributed architecture). For example if you wanted the systems that create the data file to be different than the ones hosting the site.
http://aws.amazon.com/autoscaling/ -- Allows you to adjust the number of instances online based on traffic
Make sure to have a good backup process with EC2, snapshot your OS drive often and place any volatile data (e.g. a database files) on an EBS block. EC2 doesn't fail often but when it does you don't have access to the hardware, and if you have an up to date snapshot you can just kick a new instance online.
Depending on the datasets, Cassandra can also significantly improve response times for queries.
There is an excellent explanation of the data structure used in NoSQL solutions that may help you see if this is an appropriate solution to help:
WTF is a Super Column