Should I persist images on EBS or S3? - amazon-s3

I am migrating my Java,Tomcat, Mysql server to AWS EC2.
I have already attached EBS volume for storing MySql data. In my web application people may upload images. So I should persist them. There are 2 alternatives in my mind:
Save uploaded images to EBS volume.
Use the S3 service.
The followings are my notes, please be skeptic about them, as my expertise is not on servers, but software development.
EBS plus: S3 storage is more expensive. (0.15 $/Gb > 0.1$/Gb)
S3 plus: Serving statics from EBS may influence my web server's performance negatively. Is this true? Does Serving images affect server performance notably? For S3 my server will not be responsible for serving statics.
S3 plus: Serving statics from EBS may result I/O cost, probably it will be minor.
EBS plus: People say EBS is faster.
S3 plus: People say S3 is more safe for persistence.
EBS plus: No need to learn API, it is straight forward to save the images to EBS volume.
Namely I can not decide, will be happy if you guide.
Thanks

The price comparison is not quite right:
S3 charges are $0.14 per GB USED, whereas EBS charges are $0.10 per GB PROVISIONED (the size of your EBS volume), whether you use it or not. As a result, S3 may or may not be cheaper than EBS.

I'm currently using S3 for a project and it's working extremely well.
EBS means you need to manage a volume + machines to attach it to. You need to add space as it's filling up and perform backups (not saying you shouldn't back up your S3 data, just that it's not as critical).
It also makes it harder to scale: when you want to add additional machines, you either need to pull off the images to a separate machine or clone the images across all. This also means you're adding a bottleneck: you'll have to manage your own upload process that will either upload to all machines or have a single machine managing it.
I recommend S3: it's set and forget. Any number of machines can be performing uploads in parallel and you don't really need to notify other machines about the upload.
In addition, you can use Amazon Cloudfront as a cheap CDN in front of the images instead of directly downloading from S3.

I have architected solutions on AWS for Stock photography sites which stores millions of images spanning TB's of data, I would like to share some of the best practice in AWS for your requirement:
P1) Store the Original Image file in S3 Standard option
P2) Store the reproducible images like thumbs etc in the S3 Reduced Redundancy option (RRS) to save costs
P3) Meta data about images including the S3 URL can be stored in Amazon RDS or Amazon DynamoDB depending upon the query complexity. Query the entries from Amazon RDS. If your query is complex it is also common practice to Store the meta data in Amazon CloudSearch or Apache Solr.
P4) Deliver your thumbs to users with low latency using Amazon CloudFront.
P5) Queue your image conversion either thru SQS or RabbitMQ on Amazon EC2
P6) If you are planning to use EBS, then they are not scalable with your EC2. So ideally you can use GlusterFS as your common storage pool for all your images. Multiple Amazon EC2 in Auto Scaled mode can still connect to it and access/write images.

You already outlined the advantages and disadvantages of both.
If you are planning to store terabytes of images, with storage requirements increasing day after day, S3 will probably be your best bet as it is built especially for these kinds of situations. You get unlimited storage space, without having to worry about sharding your data over many EBS volumes.
The recurrent cost of S3 is that it comes 50% more expensive than EBS. You will also have to learn the API and implement it in your application, but that is a one-off expense which I think you should be able to absorb very quickly.

Do you expect the images to last indefinitely?
The Amazon EBS FAQ is pretty clear; the annual failure rate is not "essentially zero"; they quote 0.1% to 0.5%. It's better than the disk under your desk, but it would need some kind of backup.

Related

AWS S3 alternatives for private cloud

Right now we have a requirement to migrate from AWS to private Data Center. We need to find out potential alternative storage instead of AWS S3.
Currently S3 is used in the following way:
Overall storage size is 10TB;
Min/Avg/Max object size is 0.5/2/100 Mb;
We have N App instances that simultaneously writes/reads
objects approximately 50 writes/sec, 30 reads/sec;
This storage should be redundant (Highly Available), Fault Tolerant, Scalable;
The naive implementation could be store this data on:
Simple NFS storage and add some replication functionality;
Just store mentioned objects in NoSQL DB (as example in Cassandra). However Cassandra will require a number of instances to support this storage (It's nor recommended to store > 1TB pn 1 Cassandra node Cassandra capacity planning)
What solution would you recommend for such scenario ?
Using MinIO is your best bet if you want to have a private cloud storage. It is AWS S3 compatible meaning that applications use AWS S3 can be migrated to MinIO seamlessly. They have a tutorial how to connect MinIO server with AWS CLI. You can test it against the public hosted MinIO server https://play.min.io:9000. Please refer to AWS CLI with MinIO Server.
You can have highly available storage system using MinIO distributed setup. Beware that the dynamic expansion is not a feature of MinIO distributed setup. If you want to expand your cluster you end up spinning a new cluster with your desired number of servers/disks and then you have to migrate your data from old one to new one.
I find it much more easier to use than HDFS. In addition to this, there are a lot of technologies outside Hadoop ecosystem lack HDFS integration. For example, Docker Registry lacks built in HDFS storage driver. However, it has a S3 driver so you can use MinIO as it's object storage.
There're a bunch of options as of S3-compatible private cloud service. if you like open source solutions, the above open stack and Cassandra are good ones. Note that usually no matter what you use, probably you end up setting up a cloud with multiple nodes and this is inevitable to exchange for redundancy and availability. There're some good commercial and economic products as well such as the one from Cloudian
If you need object store I could recommend elliptics (in english).
As I know, it doesn't has limits on disk store.
In case for Cassandra we are using SSD disks (for better performance) < 200-500 Gb. Ring size would be depend from your requirements (read/write latency, replication rate, time to life).
50 writes/sec, 30 reads/sec
This is really quite easy for Cassandra, as I can compare with our setup.
In that case it more depends from time to life for your objects.
Generally, in case for distributed network you also could look at GlusterFS.
You can use OpenStack Swift
Swift is a highly available, distributed, eventually consistent object/blob store. Organizations can use Swift to store lots of data efficiently, safely, and cheaply.
Learn More on : https://docs.openstack.org/swift/latest/
And https://oldhenhut.com/2016/05/31/s3-vs-swift/

Using Glacier as back end for web crawling

I will be doing a crawl of several million URLs from EC2 over a few months and I am thinking about where I ought to store this data. My eventual goal is to analyze it, but the analysis might not be immediate (even though I would like to crawl it now for other reasons) and I may want to eventually transfer a copy of the data out for storage on a local device I have. I estimate the data will be around 5TB.
My question: I am considering using Glacier for this, with the idea that I will run a multithreaded crawler that stores the crawled pages locally (on EB) and then use a separate thread that combines, compresses, and shuttles that data to Glacier. I know transfer speeds on Glacier are not necessarily good, but since there is no online element of this process, it would seem feasible (esp since I could always increase the size of my local EBS volume in case I'm crawling faster than I can store to Glacier).
Is there a flaw in my approach or can anyone suggest a more cost-effective, reliable way to do this?
Thanks!
Redshift seems more relevant than Glacier. Glacier is all about freeze / thaw and you'll have to move the data prior to doing any analysis.
Redshift is more about adding the data into a large, inexpensive, data warehouse and running queries over it.
Another option is to store the data in EBS and leave it there. When you're done with your crawling take a Snapshot to push the volume into S3 and decomission the volume and EC2 instance. Then when you're ready to do the analysis just create a volume from the snapshot.
The upside of this approach is that it's all file access (no formal data store) which may be easier for you.
Personally, I would probably push the data into Redshift. :-)
--
Chris
If your analysis will not be immediate then you can adopt one of the following 2 approaches
Approach 1) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> archive regularly to glacier. You can store your last X days data in Amazon S3 and use it for adhoc processing as well.
Approach 2) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon Glacier. Retrieve when needed and do the processing on EMR or other processing tools
If you need frequent analysis:
Approach 3) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> Analysis through EMR or other tools and store the processed results in S3/DB/MPP and move the raw files to glacier
Approach 4) if your data is structured, then Amazon EC2 crawler -> store in EBS disks and move them to Amazon RedShift and move the raw files to glacier
Additional tips:
If you can retrieve the data again(from source)then you can use ephemeral disks for your crawlers instead of EBS
Amazon has introduced Data pipeline service, check whether it fits your needs on data movement.

S3 or EBS for storing data in flat files

I have flat files in which I store data and retrieve it instead of storing to database. This is temporary and may last for couple of months.I was wondering If I should be using EBS or S3. EBS is mainly used for I/O , S3 for content delivery , but S3 is on use you go model and EBS is you have to pay for the volume purchased ?
Pls guide, which one is better ?
S3 sounds like it's more appropriate for your use case.
S3 is object storage. Think of it as an Amazon-run file server. (Objects are not exactly equal to files, but it's close enough here.) You tell S3 to put a file, it'll store it. You tell S3 to get a file, it'll get return it. You tell S3 to delete it, it's gone. This is easy to work with and very scalable.
EBS is block storage. Think of it as an Amazon-run external hard drive. You can plug an EBS volume into an EC2 virtual machine, or you access it over the Internet via AWS Storage Gateway. Like an external hard drive, you can only plug it into one computer at a time. The size is set up front, and while there are ways to grow and shrink it, you're paying for all the bits all the time. It's also much more complex than S3, since it has to provide strong consistency guarantees for the entire volume, instead of just on a file-by-file basis.
To build on the good answer from willglynn. If you are interacting with the data regularly, or need more file-system-like access you might consider EBS more strongly.
If the amount of data is relatively small and you read and write to the data store regularly, you might consider something like elasticache for in-memory storage which would likely be superior performance-wise then using s3 or EBS.
Similarly, you might look at DynamoDb for document type storage, especially if you need to be able to search/filter across your data objects.
Point 1) You can use both S3 and EBS for this option. If you want reduced latency and file sizes are bigger then EBS is better option.
Point 2) If you want lower costs, then S3 is a better option.
From what you describe, S3 will be the most cost-effective and likely easiest solution.
Pros to S3:
1. You can access the data from anywhere. You don't need to spin up an EC2 instance.
2. Crazy data durability numbers.
3. Nice versioning story around buckets.
4. Cheaper than EBS
Pros to EBS
1. Handy to have the data on a file system in EC2. That let you do normal processing with the Unix pipeline.
2. Random Access patterns work as you would expect.
3. It's a drive. Everyone knows how to deal with files on drives.
If you want to get away from a flat file, DynamoDB provides a nice set of interfaces for putting lots and lots of rows into a table, then running operations against those rows.

On what factors does the download speed of assets from Amazon S3 depends?

How fast can we download files from Amazon S3, is there an upper limit (and they distribute it between all the requests from the same user), or does it only depend on my internet connection download speed? I couldn't find it in their SLA.
What other factors does it depend on? Do they throttle the data transfer rate at some level to prevent abuse?
This has been addressed in the recent Amazon S3 team post Amazon S3 Performance Tips & Tricks:
First: for smaller workloads (<50 total requests per second), none of
the below applies, no matter how many total objects one has! S3 has a
bunch of automated agents that work behind the scenes, smoothing out
load all over the system, to ensure the myriad diverse workloads all
share the resources of S3 fairly and snappily. Even workloads that
burst occasionally up over 100 requests per second really don't need
to give us any hints about what's coming...we are designed to just
grow and support these workloads forever. S3 is a true scale-out
design in action.
S3 scales to both short-term and long-term workloads far, far greater
than this. We have customers continuously performing thousands of
requests per second against S3, all day every day. [...] We worked with other
customers through our Premium Developer Support offerings to help them
design a system that would scale basically indefinitely on S3. Today
we’re going to publish that guidance for everyone’s benefit.
[emphasis mine]
You may want to read the entire post to gain more insight into the S3 architecture and resulting challenges for really massive workloads (i.e., as stressed by the S3 team, it won't apply at all for most use cases).

S3 to EC2 Performance for fetching large numbers of small files

I have a large collection of data chunks sized 1kB (in the order of several hundred million), and need a way to store and query these data chunks. The data chunks are added, but never deleted or updated. Our service is deployed on the S3, EC2 platform.
I know Amazon SimpleDB exists, but I want a solution that is platform agnostic (in case we need to move out of AWS for example).
So my question is, what are the pro's and con's of these two options for storing and retrieving data chunks. How would the performance compare?
Store the data chunks as files on S3 and GET them when needed
Store the data chunks on a MySQL Server cluster
Would there be that much of a performance difference?
I tried using S3 as a sort of "database" using tiny XML files to hold my structured data objects, and relying on the S3 "keys" to look up these objects.
The performance was unacceptable, even from EC2 - the latency to S3 is just too high.
Running MySQL on an EBS device will be an order of magnitude faster, even with so many records.
Do you need to provide access to these data chunks directly to the users of your application? If not, then S3 and HTTP GET requests are an overkill. Having also in mind that S3 is a secured service, the overhead for every GET request (for just 1KB of data) will be considerably large.
MySQL server cluster would be a better idea, but to run in EC2 you need to employ Elastic Block Storage. Finally, do not rule out SimpleDB. It is perhaps the best solution for your problem. Design your system carefully and you would be able to easily migrate in other database systems (distributed or relational) in the future.