S3 redundancy, 3+ availability zone guarantee - amazon-s3

According to the S3 FAQ:
"Amazon S3 Standard, S3 Standard-Infrequent Access, and S3 Glacier
storage classes replicate data across a minimum of three AZs to
protect against the loss of one entire AZ. This remains true in
Regions where fewer than three AZs are publicly available."
I'm not clear on what this means. Suppose you store your data in a region with fewer than three AZs that are "publicly available." Does that mean that Amazon will store your data in an AZ within that region that is not publicly available, if necessary? Or that it will store your data in an AZ in another region to make up the difference?

S3 will store your data in an AZ that is not publicly available. The same is also true for DynamoDB, and possibly other services as well.
Source:
I want to say I heard it at a re:Invent session. I’ll try to find a link to some documentation.

This says even if you have mentioned AZ where publicly available AZs are < 3, Amazon S3 makes sure to replicate your data in a total of at least 3 AZs(including public & non-public).

Related

What are pros and cons of using AWS S3 vs Cassandra as a image store?

Which DB is better for storing images in a photo-sharing application?
We don't recommend storing images directly in Cassandra. Most companies (they're household names you'd know very well and likely using their services) store images/videos/media on an object store like AWS S3 and Google Cloud Store.
Only the metadata of the media are stored in Cassandra for very fast retrieval -- S3 URL/URI, user info, media info, etc.
The advantage of using Cassandra is that it can be deployed to a hybrid combination of public clouds so you're not tied to one vendor. Being able to distribute your Cassandra nodes across clouds means that you can get as close as possible to your users. Cheers!
AWS S3 is an object storage service that works very well for unstructured data. It offers infinite store where the size of an object is restricted to 5TB. S3 is suitable for storing large objects.
DynamoDB is a NoSQL, low latency database which is suitable for semi-structured data. DynamoDB uses cases are usually where we want to store large number of small records and have a millisecond latency, DynamoDB record size limit is 400KB
For a photosharing application, you need Both S3 and DynamoDB. S3 acts as a storage, DynamoDB is your Database which lists all galleries, files, timestamps, captions, users etc
You can store photos in Amazon S3, but photo's metadata in someother database.
Amazon S3 well suited for any objects for large size as well.

Replication between two storage accounts in different regions, must be read/writeable and zone redundant

We are setting up an active/active configuration using either front door or traffic manager as our front end. Our services are located in both Central and East US 2 paired regions. There is an AKS cluster in each region. The AKS clusters will write data to a storage account located in their region. However, the files in the storage accounts must be the same in each region. The storage accounts must be zone redundant and read/writeable in each region at all times, thus none of the Microsoft replication strategies work. This replication must be automatic, we can't have any manual process to do the copy. I looked at Data Factory but it seems to be regional, so I don't think that would work, but it's a possibility....maybe. Does anyone have any suggestions on the best way to accomplish this task?
I have tested in my environment.
Replication between two storage accounts can be implemented using the Logic App.
In the logic app, we can create two workflows. One for replicating data from storage account 1 to storage account 2. Other for replicating data from storage account 2 to storage account 1.
I have tried to replicate blob data between storage accounts in different regions.
The workflow is :
When a blob is added or modified in the storage account 1, the blob will be copied to the storage account 2
Trigger : When a blob is added or modified (properties only) (V2) (Use connection setting of storage account1)
Action : Copy blob (V2) ) (Use connection setting of storage account2)
Similar way, we can create another workflow for replication of data from Storage Account 2 to Storage Account 1.
Now, the data will be replicated between the two storage accounts.

AWS S3 alternatives for private cloud

Right now we have a requirement to migrate from AWS to private Data Center. We need to find out potential alternative storage instead of AWS S3.
Currently S3 is used in the following way:
Overall storage size is 10TB;
Min/Avg/Max object size is 0.5/2/100 Mb;
We have N App instances that simultaneously writes/reads
objects approximately 50 writes/sec, 30 reads/sec;
This storage should be redundant (Highly Available), Fault Tolerant, Scalable;
The naive implementation could be store this data on:
Simple NFS storage and add some replication functionality;
Just store mentioned objects in NoSQL DB (as example in Cassandra). However Cassandra will require a number of instances to support this storage (It's nor recommended to store > 1TB pn 1 Cassandra node Cassandra capacity planning)
What solution would you recommend for such scenario ?
Using MinIO is your best bet if you want to have a private cloud storage. It is AWS S3 compatible meaning that applications use AWS S3 can be migrated to MinIO seamlessly. They have a tutorial how to connect MinIO server with AWS CLI. You can test it against the public hosted MinIO server https://play.min.io:9000. Please refer to AWS CLI with MinIO Server.
You can have highly available storage system using MinIO distributed setup. Beware that the dynamic expansion is not a feature of MinIO distributed setup. If you want to expand your cluster you end up spinning a new cluster with your desired number of servers/disks and then you have to migrate your data from old one to new one.
I find it much more easier to use than HDFS. In addition to this, there are a lot of technologies outside Hadoop ecosystem lack HDFS integration. For example, Docker Registry lacks built in HDFS storage driver. However, it has a S3 driver so you can use MinIO as it's object storage.
There're a bunch of options as of S3-compatible private cloud service. if you like open source solutions, the above open stack and Cassandra are good ones. Note that usually no matter what you use, probably you end up setting up a cloud with multiple nodes and this is inevitable to exchange for redundancy and availability. There're some good commercial and economic products as well such as the one from Cloudian
If you need object store I could recommend elliptics (in english).
As I know, it doesn't has limits on disk store.
In case for Cassandra we are using SSD disks (for better performance) < 200-500 Gb. Ring size would be depend from your requirements (read/write latency, replication rate, time to life).
50 writes/sec, 30 reads/sec
This is really quite easy for Cassandra, as I can compare with our setup.
In that case it more depends from time to life for your objects.
Generally, in case for distributed network you also could look at GlusterFS.
You can use OpenStack Swift
Swift is a highly available, distributed, eventually consistent object/blob store. Organizations can use Swift to store lots of data efficiently, safely, and cheaply.
Learn More on : https://docs.openstack.org/swift/latest/
And https://oldhenhut.com/2016/05/31/s3-vs-swift/

what is better Amazon EBS or S3 for streaming and uploading video

What is better to use, EC2 instances for EBS or Amazon S3 for subscription based streaming channel like Netflix.
150GB upload per month, 250GB streaming per month, no peak time, with viewers based around Australia, India, North America, Europe, Brazil
and 80TB of storage that needs to migrate to the cloud?
For scalability and worldwide presence, the definite answer (using only AWS services) is:
Store videos on Amazon S3
Serve videos through Amazon CloudFront
Amazon CloudFront has presence in 70+ locations around the world and will handle the video streaming protocols for you. Mark content as private and have your application determine whether users are entitled to view videos. You can then generate pre-signed URLs that permit access to a given video for a limited period of time. See: Serving Private Content through CloudFront
In comparison, using Amazon EC2 + Amazon EBS is a poor choice because:
You would need to scale-out additional instances based upon your load
You would need to run instances in multiple regions to be closer to your users (hence lower latency)
You would need to replicate all videos to every server rather than storing a single copy of each video
Please note that your largest cost will be Data Transfer (see Amazon CloudFront Pricing. Your quoted figure of "250GB streaming per month" seems extremely low -- my family alone uses that much bandwidth per month!

S3: Duplicate buckets

What is the easiest way to duplicate an entire Amazon S3 bucket to a bucket in a different account?
Ideally, we'd like to duplicate the bucket nightly to a different account in Amazon's European data center for backup purposes.
One thing to consider is that you might want to have whatever is doing this running in an Amazon EC2 VM. If you have your backup running outside of Amazon's cloud then you pay for the data transfer both ways. If you run in an EC2 VM, you pay no bandwidth fees (although I'm not sure if this is true when going between the North American and European stores) - only for the wall time that the EC2 instance is running (and whatever it costs to store the EC2 VM, which should be minimal I think).
Cool, I may look into writing a script to host on Ec2. The main purpose of the backup is to guard against human error on our side -- if a user accidentally deletes a bucket or something like that.
If you're worried about deletion, you should probably look at S3's new Versioning feature.
I suspect there is no "automatic" way to do this. You'll just have to write a simple app that moves the files over. Depending on how you track the files in S3 you could move just the "changes" as well.
On a related note, I'm pretty sure Amazon does a darn good job backup up the data so I don't think you necessarily need to worry about data loss, unless your back up for archival purposes, or you want to safeguard against accidentally deleting files.
You can make an application or service that responsible to create two instances of AmazonS3Client one for the source and the other for the destination, then the source AmazonS3Client start looping in the source bucket and streaming objects in, and the destination AmazonS3Client streaming them out to the destination bucket.
Note: this doesn't work for cross-account syncing, but this works for cross-region on the same account.
For simply copying everything from one bucket to another, you can use the AWS CLI (https://aws.amazon.com/premiumsupport/knowledge-center/move-objects-s3-bucket/): aws s3 sync s3://SOURCE_BUCKET_NAME s3://NEW_BUCKET_NAME
In your case, you'll need the --source-region flag: https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
If you are moving an enormous amount of data, you can optimize how quickly it happens by finding ways to split the transfers into different groups: https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/
There are a variety of ways to run this nightly. One is example is the AWS instance-schedule (personally unverified) https://docs.aws.amazon.com/solutions/latest/instance-scheduler/appendix-a.html