AWS S3 alternatives for private cloud - amazon-s3

Right now we have a requirement to migrate from AWS to private Data Center. We need to find out potential alternative storage instead of AWS S3.
Currently S3 is used in the following way:
Overall storage size is 10TB;
Min/Avg/Max object size is 0.5/2/100 Mb;
We have N App instances that simultaneously writes/reads
objects approximately 50 writes/sec, 30 reads/sec;
This storage should be redundant (Highly Available), Fault Tolerant, Scalable;
The naive implementation could be store this data on:
Simple NFS storage and add some replication functionality;
Just store mentioned objects in NoSQL DB (as example in Cassandra). However Cassandra will require a number of instances to support this storage (It's nor recommended to store > 1TB pn 1 Cassandra node Cassandra capacity planning)
What solution would you recommend for such scenario ?

Using MinIO is your best bet if you want to have a private cloud storage. It is AWS S3 compatible meaning that applications use AWS S3 can be migrated to MinIO seamlessly. They have a tutorial how to connect MinIO server with AWS CLI. You can test it against the public hosted MinIO server https://play.min.io:9000. Please refer to AWS CLI with MinIO Server.
You can have highly available storage system using MinIO distributed setup. Beware that the dynamic expansion is not a feature of MinIO distributed setup. If you want to expand your cluster you end up spinning a new cluster with your desired number of servers/disks and then you have to migrate your data from old one to new one.
I find it much more easier to use than HDFS. In addition to this, there are a lot of technologies outside Hadoop ecosystem lack HDFS integration. For example, Docker Registry lacks built in HDFS storage driver. However, it has a S3 driver so you can use MinIO as it's object storage.

There're a bunch of options as of S3-compatible private cloud service. if you like open source solutions, the above open stack and Cassandra are good ones. Note that usually no matter what you use, probably you end up setting up a cloud with multiple nodes and this is inevitable to exchange for redundancy and availability. There're some good commercial and economic products as well such as the one from Cloudian

If you need object store I could recommend elliptics (in english).
As I know, it doesn't has limits on disk store.
In case for Cassandra we are using SSD disks (for better performance) < 200-500 Gb. Ring size would be depend from your requirements (read/write latency, replication rate, time to life).
50 writes/sec, 30 reads/sec
This is really quite easy for Cassandra, as I can compare with our setup.
In that case it more depends from time to life for your objects.
Generally, in case for distributed network you also could look at GlusterFS.

You can use OpenStack Swift
Swift is a highly available, distributed, eventually consistent object/blob store. Organizations can use Swift to store lots of data efficiently, safely, and cheaply.
Learn More on : https://docs.openstack.org/swift/latest/
And https://oldhenhut.com/2016/05/31/s3-vs-swift/

Related

Is there any EMRFS SDK to access S3 from EC2 Machine?

I've read this and I conclude that EMRFS is only available if I am using AWS EMR machine.
I am asking this because I am interested with EMRFS's read-after-write-consistency for s3.
I just would like to put new input to this question: there is an in-progress community effort that provides a consistent S3 model in Hadoop: S3Guard: Improved Consistency for S3A.
As the description mentioned:
This issue proposes S3Guard, a new feature of S3A, to provide an option for a stronger consistency model than what is currently offered. The solution coordinates with a strongly consistent external store to resolve inconsistencies caused by the S3 eventual consistency model.
For more information, please refer to the design doc.
This will be part of Hadoop distribution in the next release probably Hadoop 3.0.
UPDATE: Steve just kindly backport that to Hadoop 2.9.
It would take a bit more manual configuration, but you can get a similar setup as EMRFS + EMR's consistent view by using the existing open-source NativeS3FileSystem alongside Netflix's s3mper, which uses the same DynamoDB-backed configuration as EMRFS.
If you are looking just for read after write consistency then you can just use S3 as-is (all regions support read after write consistency) with EMR. The catch is for US-Standard buckets just set fs.s3n.endpoint=s3-external-1.amazonaws.com in EMR and use that same endpoint on all non-EMR applications.

Amazon S3 vs Window Azure Blob Storage

I went through the docs of both Azure and Amazon S3, but I confused about few things -
Both of these try to solve the same question i.e, storage on the cloud. Now My Question here is that when to use what? i.e., when is it preferred to use Azure and when Amazon S3 is preferred. I googled about it hard and couldn't find any substantial resource for the same. I would really appreciate if some one could enlighten me regarding the same.
EDIT:
I want to consider following params as the base for choosing my cloud provider -
1) Latency
2) Scalability
3) Size of each file
4) Cost & Performance
5) Files which are can be accessed quite randomly.
These are few params I have considered. It would be great if you can provide additional Params to consider.
There are many studies online. You should evaluate it by yourself based on your workload and scenario.
One of many reports, says that Azure is good at small files: http://www.nasuni.com/resource/the-state-of-cloud-storage-in-2013
Under my understanding, this is because Azure blob is designed to be an unified storage, so it optimize for small block access. Conversely, S3 is originate from web storage.
On the other hand, S3 is good at scale up, since Azure has limitation per account.

AWS s3 standard vs. reduced redundancy storage?

Does anyone know of any real-world analysis on data loss using these two AWS s3 storage options? I know from the AWS docs (via Quora) that one is 99.9999999% guarenteed and the other is only 99.99% gaurenteed but I'm looking for data from a non-AWS source.
Anecdotes or something more thorough would both be great. I apologize if this isn't the right SE site for this question. Feel free to suggest a place to migrate it.
I guess it depends on the data you're storing if you really need 99.999999999% level of durability …
If you keep copies of your data locally and are just using S3 as a convenient place to store data that is actively being accessed by services within the AWS infrastructure, RRS might be the right choice for you :)
In my case, I keep fresh files on the normal durability level till I created a local backup and then move them to RRS, which saves you quite a bit a money.

Planning the development of a scalable web application

We have created a product that potentially will generate tons of requests for a data file that resides on our server. Currently we have a shared hosting server that runs a PHP script to query the DB and generate the data file for each user request. This is not efficient and has not been a problem so far but we want to move to a more scalable system so we're looking in to EC2. Our main concerns are being able to handle high amounts of traffic when they occur, and to provide low latency to users downloading the data files.
I'm not 100% sure on how this is all going to work yet but this is the idea:
We use an EC2 instance to host our admin panel and to generate the files that are being served to app users. When any admin makes a change that affects these data files (which are downloaded by users), we make a copy over to S3 using CloudFront. The idea here is to get data cached and waiting on S3 so we can keep our compute times low, and to use CloudFront to get low latency for all users requesting the files.
I am still learning the system and wanted to know if anyone had any feedback on this idea or insight in to how it all might work. I'm also curious about the purpose of projects like Cassandra. My understanding is that simply putting our application on EC2 servers makes it scalable by the nature of the servers. Is Cassandra just about keeping resource usage low, or is there a reason to use a system like this even when on EC2?
CloudFront: http://aws.amazon.com/cloudfront/
EC2: http://aws.amazon.com/cloudfront/
Cassandra: http://cassandra.apache.org/
Cassandra is a non-relational database engine and if this is what you need, you should first evaluate Amazon's SimpleDB : a non-relational database engine built on top of S3.
If the file only needs to be updated based on time (daily, hourly, ...) then this seems like a reasonable solution. But you may consider placing a load balancer in front of 2 EC2 images, each running a copy of your application. This would make it easier to scale later and safer if one instance fails.
Some other services you should read up on:
http://aws.amazon.com/elasticloadbalancing/ -- Amazons load balancer solution.
http://aws.amazon.com/sqs/ -- Used to pass messages between systems, in your DA (distributed architecture). For example if you wanted the systems that create the data file to be different than the ones hosting the site.
http://aws.amazon.com/autoscaling/ -- Allows you to adjust the number of instances online based on traffic
Make sure to have a good backup process with EC2, snapshot your OS drive often and place any volatile data (e.g. a database files) on an EBS block. EC2 doesn't fail often but when it does you don't have access to the hardware, and if you have an up to date snapshot you can just kick a new instance online.
Depending on the datasets, Cassandra can also significantly improve response times for queries.
There is an excellent explanation of the data structure used in NoSQL solutions that may help you see if this is an appropriate solution to help:
WTF is a Super Column

Should I persist images on EBS or S3?

I am migrating my Java,Tomcat, Mysql server to AWS EC2.
I have already attached EBS volume for storing MySql data. In my web application people may upload images. So I should persist them. There are 2 alternatives in my mind:
Save uploaded images to EBS volume.
Use the S3 service.
The followings are my notes, please be skeptic about them, as my expertise is not on servers, but software development.
EBS plus: S3 storage is more expensive. (0.15 $/Gb > 0.1$/Gb)
S3 plus: Serving statics from EBS may influence my web server's performance negatively. Is this true? Does Serving images affect server performance notably? For S3 my server will not be responsible for serving statics.
S3 plus: Serving statics from EBS may result I/O cost, probably it will be minor.
EBS plus: People say EBS is faster.
S3 plus: People say S3 is more safe for persistence.
EBS plus: No need to learn API, it is straight forward to save the images to EBS volume.
Namely I can not decide, will be happy if you guide.
Thanks
The price comparison is not quite right:
S3 charges are $0.14 per GB USED, whereas EBS charges are $0.10 per GB PROVISIONED (the size of your EBS volume), whether you use it or not. As a result, S3 may or may not be cheaper than EBS.
I'm currently using S3 for a project and it's working extremely well.
EBS means you need to manage a volume + machines to attach it to. You need to add space as it's filling up and perform backups (not saying you shouldn't back up your S3 data, just that it's not as critical).
It also makes it harder to scale: when you want to add additional machines, you either need to pull off the images to a separate machine or clone the images across all. This also means you're adding a bottleneck: you'll have to manage your own upload process that will either upload to all machines or have a single machine managing it.
I recommend S3: it's set and forget. Any number of machines can be performing uploads in parallel and you don't really need to notify other machines about the upload.
In addition, you can use Amazon Cloudfront as a cheap CDN in front of the images instead of directly downloading from S3.
I have architected solutions on AWS for Stock photography sites which stores millions of images spanning TB's of data, I would like to share some of the best practice in AWS for your requirement:
P1) Store the Original Image file in S3 Standard option
P2) Store the reproducible images like thumbs etc in the S3 Reduced Redundancy option (RRS) to save costs
P3) Meta data about images including the S3 URL can be stored in Amazon RDS or Amazon DynamoDB depending upon the query complexity. Query the entries from Amazon RDS. If your query is complex it is also common practice to Store the meta data in Amazon CloudSearch or Apache Solr.
P4) Deliver your thumbs to users with low latency using Amazon CloudFront.
P5) Queue your image conversion either thru SQS or RabbitMQ on Amazon EC2
P6) If you are planning to use EBS, then they are not scalable with your EC2. So ideally you can use GlusterFS as your common storage pool for all your images. Multiple Amazon EC2 in Auto Scaled mode can still connect to it and access/write images.
You already outlined the advantages and disadvantages of both.
If you are planning to store terabytes of images, with storage requirements increasing day after day, S3 will probably be your best bet as it is built especially for these kinds of situations. You get unlimited storage space, without having to worry about sharding your data over many EBS volumes.
The recurrent cost of S3 is that it comes 50% more expensive than EBS. You will also have to learn the API and implement it in your application, but that is a one-off expense which I think you should be able to absorb very quickly.
Do you expect the images to last indefinitely?
The Amazon EBS FAQ is pretty clear; the annual failure rate is not "essentially zero"; they quote 0.1% to 0.5%. It's better than the disk under your desk, but it would need some kind of backup.