What are some use cases for object storage? - amazon-s3

What are some use cases for object storage, as opposed to file systems or block storage (database) systems?
From what I understand, object storage is mostly used for persistent storage for applications running on cloud systems. It seems to have a lot of overlap with file systems, except that the details of how the objects are stored is abstracted away so that apps can access them with simple web queries.
However, I'd love if someone could give examples of applications where this is actually used instead of or alongside the other two storage systems.

Some example use cases for object storage:
Off-site backups
Storing and serving user content (e.g. profile pictures)
Storing artifacts (e.g. JAR files, startup scripts) to be deployed to VMs
Distributing static content (e.g. video content for your users)
Caching intermediate data (e.g. individual frames from a render farm before assembly into output video)
Accepting input or providing output to a web service (as accepting data by POST can be difficult/inefficient for large input files).
archiving data for regulatory purposes
All these cases might be accompanied by a database to store metadata (ie to find the objects). Actually storing the data in the database would, however, exceed size limits or significantly harm database performance.
These use-cases can be achieved with a file-system, so long as your total usage can be handed by a single machine. If you have more traffic than that you will need replicated storage, load balancing etc, at which point you are effectively implementing a object storage system yourself.

Related

Need for metadata store while storing an object

While checking out the design of a service like pastebin, I noticed the usage of two different storage systems:
An object store(such as Amazon S3) for storing the actual "paste" data
A metadata store to store other things pertaining to that "paste" data; such as - URL Hash(to access that paste data), Reference to the actual paste data etc.
I am trying to understand the need for this metadata store.
Is this generally the recommended way? Any specific advantage we get from using the metadata store?
Do object storage systems NOT allow metadata to be stored along with the actual object in the same storage server?
Object storage systems generally do allow quite a lot of metadata to be attached to the object.
But then your metadata is at the mercy of the object store.
Your metadata search is limited to what the object store allows.
Analysis, notification (a-la inotify) etc. are at limited to what the object store allows.
If you wanted to move from S3 to Google Cloud Storage, or to do both, you'd have to normalize your metadata.
Your metadata size limitations are limited to that of the object store.
You can't do cross-object-store metadata (e.g. a link that refers to multiple paste data).
You might not be able to have binary metdata.
Typically, metadata is both very important, and very heavily used by the business, so it has separate usage characteristics than the data, so it makes sense to put it on storage with different characteristics.
I can't find anywhere how pastebin.com makes money, so I don't know how heavily they use metadata, but merely the lookup, the translation between URL and paste data, is not something you can do securely with object storage alone.
Great answer above, just to add on - two more advantages are caching and scaling up both storage systems individually.
If you just use an object storage, and say a paste is 5 MB, would you cache all of it? Metadata storage also allows to improve UX by caching say first 10 or 100 KB of data for a paste for the user to preview, meanwhile the complete object is fetched in the background. This upper bound also helps to design cache deterministically.
You can also scale the object store and the metadata store independently of each other as per performance/ capacity needs. Lookups in the metadata store will also be quicker since it's less bulkier.
Your concern is legitimate that separating the storage into 2 tables (or mediums) does add some latency, but it's always a compromise with System Design, there is hardly a Win-Win situation.

Splitting Sensenet content repository into multiple databases

Is there a way of splitting a Content repository into multiple databases? There is a great chance I'll have TBs of data, maybe even tens of TBs of data. Maintaining database bigger than 1 TB becomes an issue, so I can't imagine dealing with a bigger database. I've considered using Filestream, but having multiple databases would be much more viable solution.
If not, is there at least a way of having several repositories contained in a single web site?
Currently (as of version 7.2) sensenet requires a central database to connect to, you cannot split that into multiple parts.
There is the blob storage feature however that lets you store binaries outside of the main metadata database. You choose a blob storage implementation (e.g. the MongoDb blob provider), install it and you can start uploading files to sensenet. Binaries above a certain (configured) size will go to the external provider.
You'll have to take care of the backup of the blob storage though, because that is different for every db provider. At least the size of the metadata db will be significantly lower.

Storing Uploaded Files in Azure Web Sites: File System or Azure Storage

When using Azure Web Sites (WAWS) general opinion seems to be that uploaded content such as photo's or files should be stored in Azure Storage Blobs and not in the WAWS File System.
Clearly using Azure Storage is a great idea if you have a lot of data and need scale and redundancy however for small or simple sites it seems to add another layer of complexity and also means you can't easily use things like ImageResizer without purchasing the Azure compatible licence etc.
So given that products like WordPress from the Azure Gallery uses "/site/wwwroot/wp-content/uploads/" to store all uploaded files on WAWS is there anything wrong with using the WAWS file system for storage or are there other considerations to take into account when using Azure WAWS?
The major drawback to using the WAWS storage is that your data is now intermingled with the application. By saving all of your plugins/images/blobs externally in a database or blob storage, you retain the flexibility to redeploy your application to a new region/datacenter by just pushing your code to the new website and changing connection strings.
If your plugins/images are stored on disk in WAWS, then you need to make sure that you are backing it up appropriately. If anything happens, you need to restore the site along with all of the data that had been uploaded.
Azure Web Sites is using Azure storage as a file storage so essentially the level of complexity you're talking about is abstracted.
Another great benefit that comes with this approach is if you scale your web site to multiple instances all of them will work with exact same file content.
Of course if you want to use pure Azure Storage features like snapshots or sharing specific content to specific users this is not available as is. But for the web site purposes is quite good.
Hope that helps

Storing files (videos/images/music) in CouchDB/Cloudant vs CDN (CloudFront)?

I am new to CouchDB/Cloudant and CDN (CloudFront).
I am about to build an application using CouchDB as database.
This web application will handle a lot of files.
I know that CouchDB can store files in the database as attachments. But then I have heard about leveraging CDN to store and distribute the files all over the world.
My questions:
How is storing files in CouchDB compared to CDN (CloudFront)?
How is Cloudant's service compared to CDN (CloudFront)?
Is Google storage also a CDN?
What is the difference between Amazon CloudFront and S3?
Do I have to choose to store files either in CouchDB/Cloudant or CDN, or could/should I actually combine them?
What are best practices for storing files when using CouchDB?
Some of these questions are based on your specific implementation, but here's a generalization (not in any particular order):
Unless they have Cloudant mirrored on numerous servers around the world (effectively a CDN in its own right, just sans static files), a true CDN would probably have better response time, depending mostly on how you used Cloudant (eg, you might get good response times, but if you load the entire file into memory before outputting it, you're losing the CDN battle).
CouchDB has to process more data server-side before it can output an attachment.
CloudFront (and CDNs in general) are optimized for the fastest possible response time with the closest server.
S3 is only storage; CloudFront uses that storage and distributes it across many servers that serve the content based upon which one is closer to the user requesting that content.
Yes, you have to choose between Cloudant or the CDN; one stores them in the filesystem verbatim, the other stores them in the filesystem within the database.
I don't know the answer to some of these, eg, how CouchDB actually handles attachment storage at a low level, nor its best practices, however, this should give you enough of an idea to at least start thinking about which suits your needs best.

Planning the development of a scalable web application

We have created a product that potentially will generate tons of requests for a data file that resides on our server. Currently we have a shared hosting server that runs a PHP script to query the DB and generate the data file for each user request. This is not efficient and has not been a problem so far but we want to move to a more scalable system so we're looking in to EC2. Our main concerns are being able to handle high amounts of traffic when they occur, and to provide low latency to users downloading the data files.
I'm not 100% sure on how this is all going to work yet but this is the idea:
We use an EC2 instance to host our admin panel and to generate the files that are being served to app users. When any admin makes a change that affects these data files (which are downloaded by users), we make a copy over to S3 using CloudFront. The idea here is to get data cached and waiting on S3 so we can keep our compute times low, and to use CloudFront to get low latency for all users requesting the files.
I am still learning the system and wanted to know if anyone had any feedback on this idea or insight in to how it all might work. I'm also curious about the purpose of projects like Cassandra. My understanding is that simply putting our application on EC2 servers makes it scalable by the nature of the servers. Is Cassandra just about keeping resource usage low, or is there a reason to use a system like this even when on EC2?
CloudFront: http://aws.amazon.com/cloudfront/
EC2: http://aws.amazon.com/cloudfront/
Cassandra: http://cassandra.apache.org/
Cassandra is a non-relational database engine and if this is what you need, you should first evaluate Amazon's SimpleDB : a non-relational database engine built on top of S3.
If the file only needs to be updated based on time (daily, hourly, ...) then this seems like a reasonable solution. But you may consider placing a load balancer in front of 2 EC2 images, each running a copy of your application. This would make it easier to scale later and safer if one instance fails.
Some other services you should read up on:
http://aws.amazon.com/elasticloadbalancing/ -- Amazons load balancer solution.
http://aws.amazon.com/sqs/ -- Used to pass messages between systems, in your DA (distributed architecture). For example if you wanted the systems that create the data file to be different than the ones hosting the site.
http://aws.amazon.com/autoscaling/ -- Allows you to adjust the number of instances online based on traffic
Make sure to have a good backup process with EC2, snapshot your OS drive often and place any volatile data (e.g. a database files) on an EBS block. EC2 doesn't fail often but when it does you don't have access to the hardware, and if you have an up to date snapshot you can just kick a new instance online.
Depending on the datasets, Cassandra can also significantly improve response times for queries.
There is an excellent explanation of the data structure used in NoSQL solutions that may help you see if this is an appropriate solution to help:
WTF is a Super Column