What are pros and cons of using AWS S3 vs Cassandra as a image store? - amazon-s3

Which DB is better for storing images in a photo-sharing application?

We don't recommend storing images directly in Cassandra. Most companies (they're household names you'd know very well and likely using their services) store images/videos/media on an object store like AWS S3 and Google Cloud Store.
Only the metadata of the media are stored in Cassandra for very fast retrieval -- S3 URL/URI, user info, media info, etc.
The advantage of using Cassandra is that it can be deployed to a hybrid combination of public clouds so you're not tied to one vendor. Being able to distribute your Cassandra nodes across clouds means that you can get as close as possible to your users. Cheers!

AWS S3 is an object storage service that works very well for unstructured data. It offers infinite store where the size of an object is restricted to 5TB. S3 is suitable for storing large objects.
DynamoDB is a NoSQL, low latency database which is suitable for semi-structured data. DynamoDB uses cases are usually where we want to store large number of small records and have a millisecond latency, DynamoDB record size limit is 400KB
For a photosharing application, you need Both S3 and DynamoDB. S3 acts as a storage, DynamoDB is your Database which lists all galleries, files, timestamps, captions, users etc

You can store photos in Amazon S3, but photo's metadata in someother database.
Amazon S3 well suited for any objects for large size as well.

Related

How do you full text search in an amazon s3 bucket?

What are options to create solution based on the AWS native platform to be able to full text search in an amazon s3 bucket/s.
We have process that will be storing daily 100+ of text files ranging from 100K to 150 MB that we need to retain for 1-2 years. We want to have an ability to be able to full text search.
The Amazon S3 management console does not search inside objects. It is purely filtering on the filename ("Key") of the objects in the bucket.
If you wish to search inside objects, then you will need to implement other services such as Amazon Kendra or Elasticsearch that will read and index objects.
Amazon S3 is a "Simple Storage Service". It provides highly scalable and reliable storage, but any higher-level functions such as search need to be implemented "on top" of S3. Just think of S3 as a huge, amazingly powerful hard disk that is connected to the Internet. (Sort of.)

Is there a benefit to using multiple buckets on Amazon S3 versus consolidating into a single bucket?

We currently serve up downloadable content (mp3, pdf, mp4, zip files, etc) in a single S3 bucket called media.domainname.com.
We have a separate bucket that stores all the various video encodings for our iOS app: app.domainname.com.
We're investigating moving all of our images to S3 as well in order to ease the server load and prep us for moving to a load balanced server setup.
That said, is it better/more efficient to move our images to a separate bucket i.e., images.domainname.com? Or is it a better practice to create an images subfolder in the media bucket, like media.domainname.com/images?
What are the pros/cons of either method?
The primary benefits of using separate buckets are that you can assign separate policies to each:
Reduced redundancy to save on costs.
Versioning of changed contents.
Automatic archival to Glacier
Separate permissions
The only downside that I can think of is that it means you'd have to manage all these things separately across multiple buckets.

S3 or EBS for storing data in flat files

I have flat files in which I store data and retrieve it instead of storing to database. This is temporary and may last for couple of months.I was wondering If I should be using EBS or S3. EBS is mainly used for I/O , S3 for content delivery , but S3 is on use you go model and EBS is you have to pay for the volume purchased ?
Pls guide, which one is better ?
S3 sounds like it's more appropriate for your use case.
S3 is object storage. Think of it as an Amazon-run file server. (Objects are not exactly equal to files, but it's close enough here.) You tell S3 to put a file, it'll store it. You tell S3 to get a file, it'll get return it. You tell S3 to delete it, it's gone. This is easy to work with and very scalable.
EBS is block storage. Think of it as an Amazon-run external hard drive. You can plug an EBS volume into an EC2 virtual machine, or you access it over the Internet via AWS Storage Gateway. Like an external hard drive, you can only plug it into one computer at a time. The size is set up front, and while there are ways to grow and shrink it, you're paying for all the bits all the time. It's also much more complex than S3, since it has to provide strong consistency guarantees for the entire volume, instead of just on a file-by-file basis.
To build on the good answer from willglynn. If you are interacting with the data regularly, or need more file-system-like access you might consider EBS more strongly.
If the amount of data is relatively small and you read and write to the data store regularly, you might consider something like elasticache for in-memory storage which would likely be superior performance-wise then using s3 or EBS.
Similarly, you might look at DynamoDb for document type storage, especially if you need to be able to search/filter across your data objects.
Point 1) You can use both S3 and EBS for this option. If you want reduced latency and file sizes are bigger then EBS is better option.
Point 2) If you want lower costs, then S3 is a better option.
From what you describe, S3 will be the most cost-effective and likely easiest solution.
Pros to S3:
1. You can access the data from anywhere. You don't need to spin up an EC2 instance.
2. Crazy data durability numbers.
3. Nice versioning story around buckets.
4. Cheaper than EBS
Pros to EBS
1. Handy to have the data on a file system in EC2. That let you do normal processing with the Unix pipeline.
2. Random Access patterns work as you would expect.
3. It's a drive. Everyone knows how to deal with files on drives.
If you want to get away from a flat file, DynamoDB provides a nice set of interfaces for putting lots and lots of rows into a table, then running operations against those rows.

What are the data size limitations when using the GET,PUT methods to get and store objects in an Amazon S3 cloud?

What is the size of data that can be sent using the GET PUT methods to store and retrieve data from amazon s3 cloud and I would also like to know where I can learn more about the APIs available for storage in Amazon S3 other than the documentation that is already provided.
The PUT method is addressed in the respective Amazon S3 FAQ How much data can I store?:
The total volume of data and number of objects you can store are
unlimited. Individual Amazon S3 objects can range in size from 1 byte
to 5 terabytes. The largest object that can be uploaded in a single
PUT is 5 gigabytes. For objects larger than 100 megabytes, customers
should consider using the Multipart Upload capability. [emphasis mine]
As mentioned, Uploading Objects Using Multipart Upload API is recommended for objects larger than 100MB already, and required for objects larger than 5GB.
The GET method is essentially unlimited. Please note that S3 supports the BitTorrent protocol out of the box, which (depending on your use case) might ease working with large files considerably, see Using BitTorrent with Amazon S3:
Amazon S3 supports the BitTorrent protocol so that developers can save
costs when distributing content at high scale. [...]

Should I persist images on EBS or S3?

I am migrating my Java,Tomcat, Mysql server to AWS EC2.
I have already attached EBS volume for storing MySql data. In my web application people may upload images. So I should persist them. There are 2 alternatives in my mind:
Save uploaded images to EBS volume.
Use the S3 service.
The followings are my notes, please be skeptic about them, as my expertise is not on servers, but software development.
EBS plus: S3 storage is more expensive. (0.15 $/Gb > 0.1$/Gb)
S3 plus: Serving statics from EBS may influence my web server's performance negatively. Is this true? Does Serving images affect server performance notably? For S3 my server will not be responsible for serving statics.
S3 plus: Serving statics from EBS may result I/O cost, probably it will be minor.
EBS plus: People say EBS is faster.
S3 plus: People say S3 is more safe for persistence.
EBS plus: No need to learn API, it is straight forward to save the images to EBS volume.
Namely I can not decide, will be happy if you guide.
Thanks
The price comparison is not quite right:
S3 charges are $0.14 per GB USED, whereas EBS charges are $0.10 per GB PROVISIONED (the size of your EBS volume), whether you use it or not. As a result, S3 may or may not be cheaper than EBS.
I'm currently using S3 for a project and it's working extremely well.
EBS means you need to manage a volume + machines to attach it to. You need to add space as it's filling up and perform backups (not saying you shouldn't back up your S3 data, just that it's not as critical).
It also makes it harder to scale: when you want to add additional machines, you either need to pull off the images to a separate machine or clone the images across all. This also means you're adding a bottleneck: you'll have to manage your own upload process that will either upload to all machines or have a single machine managing it.
I recommend S3: it's set and forget. Any number of machines can be performing uploads in parallel and you don't really need to notify other machines about the upload.
In addition, you can use Amazon Cloudfront as a cheap CDN in front of the images instead of directly downloading from S3.
I have architected solutions on AWS for Stock photography sites which stores millions of images spanning TB's of data, I would like to share some of the best practice in AWS for your requirement:
P1) Store the Original Image file in S3 Standard option
P2) Store the reproducible images like thumbs etc in the S3 Reduced Redundancy option (RRS) to save costs
P3) Meta data about images including the S3 URL can be stored in Amazon RDS or Amazon DynamoDB depending upon the query complexity. Query the entries from Amazon RDS. If your query is complex it is also common practice to Store the meta data in Amazon CloudSearch or Apache Solr.
P4) Deliver your thumbs to users with low latency using Amazon CloudFront.
P5) Queue your image conversion either thru SQS or RabbitMQ on Amazon EC2
P6) If you are planning to use EBS, then they are not scalable with your EC2. So ideally you can use GlusterFS as your common storage pool for all your images. Multiple Amazon EC2 in Auto Scaled mode can still connect to it and access/write images.
You already outlined the advantages and disadvantages of both.
If you are planning to store terabytes of images, with storage requirements increasing day after day, S3 will probably be your best bet as it is built especially for these kinds of situations. You get unlimited storage space, without having to worry about sharding your data over many EBS volumes.
The recurrent cost of S3 is that it comes 50% more expensive than EBS. You will also have to learn the API and implement it in your application, but that is a one-off expense which I think you should be able to absorb very quickly.
Do you expect the images to last indefinitely?
The Amazon EBS FAQ is pretty clear; the annual failure rate is not "essentially zero"; they quote 0.1% to 0.5%. It's better than the disk under your desk, but it would need some kind of backup.