While checking out the design of a service like pastebin, I noticed the usage of two different storage systems:
An object store(such as Amazon S3) for storing the actual "paste" data
A metadata store to store other things pertaining to that "paste" data; such as - URL Hash(to access that paste data), Reference to the actual paste data etc.
I am trying to understand the need for this metadata store.
Is this generally the recommended way? Any specific advantage we get from using the metadata store?
Do object storage systems NOT allow metadata to be stored along with the actual object in the same storage server?
Object storage systems generally do allow quite a lot of metadata to be attached to the object.
But then your metadata is at the mercy of the object store.
Your metadata search is limited to what the object store allows.
Analysis, notification (a-la inotify) etc. are at limited to what the object store allows.
If you wanted to move from S3 to Google Cloud Storage, or to do both, you'd have to normalize your metadata.
Your metadata size limitations are limited to that of the object store.
You can't do cross-object-store metadata (e.g. a link that refers to multiple paste data).
You might not be able to have binary metdata.
Typically, metadata is both very important, and very heavily used by the business, so it has separate usage characteristics than the data, so it makes sense to put it on storage with different characteristics.
I can't find anywhere how pastebin.com makes money, so I don't know how heavily they use metadata, but merely the lookup, the translation between URL and paste data, is not something you can do securely with object storage alone.
Great answer above, just to add on - two more advantages are caching and scaling up both storage systems individually.
If you just use an object storage, and say a paste is 5 MB, would you cache all of it? Metadata storage also allows to improve UX by caching say first 10 or 100 KB of data for a paste for the user to preview, meanwhile the complete object is fetched in the background. This upper bound also helps to design cache deterministically.
You can also scale the object store and the metadata store independently of each other as per performance/ capacity needs. Lookups in the metadata store will also be quicker since it's less bulkier.
Your concern is legitimate that separating the storage into 2 tables (or mediums) does add some latency, but it's always a compromise with System Design, there is hardly a Win-Win situation.
Related
I'm putting together a datastore technology that relies on a backing key-value store, and has Typescript generator code to populate an index made of tiny JSON objects.
With AWS S3 in mind as an example store, I am alarmed at the possibility of a bug in my Typescript index generation simply continuing to write endless entries with unlimited cost. AWS has no mechanism that I know of to defend me from bankruptcy, so I don't want to use their services.
However, the volume of data might build for certain use cases to Megabytes or Gigabytes and access might be only occasional so cheap long term storage would be great.
What simple key-value cloud stores equivalent to S3 are there that will allow me to define limits on storage and retrieval numbers or cost? In such a service, a bug would mean I eventually start seeing e.g. 403, 429 or 507 errors as feedback that I have hit a quota, limiting my financial exposure. Preferably this would be on a per-bucket basis, rather than the whole account being frozen.
I am using S3 as a reference for the bare minimum needed to fulfil the tool's requirements, but any similar blob or object storage API (put, get, delete, list in UTF-8 order) that eventual starts rejecting requests when a quota is exceeded would be fine too.
Learning the names of qualifying systems and the terminology for their quota limit feature would give me important insights and allow me to review possible options.
What are some use cases for object storage, as opposed to file systems or block storage (database) systems?
From what I understand, object storage is mostly used for persistent storage for applications running on cloud systems. It seems to have a lot of overlap with file systems, except that the details of how the objects are stored is abstracted away so that apps can access them with simple web queries.
However, I'd love if someone could give examples of applications where this is actually used instead of or alongside the other two storage systems.
Some example use cases for object storage:
Off-site backups
Storing and serving user content (e.g. profile pictures)
Storing artifacts (e.g. JAR files, startup scripts) to be deployed to VMs
Distributing static content (e.g. video content for your users)
Caching intermediate data (e.g. individual frames from a render farm before assembly into output video)
Accepting input or providing output to a web service (as accepting data by POST can be difficult/inefficient for large input files).
archiving data for regulatory purposes
All these cases might be accompanied by a database to store metadata (ie to find the objects). Actually storing the data in the database would, however, exceed size limits or significantly harm database performance.
These use-cases can be achieved with a file-system, so long as your total usage can be handed by a single machine. If you have more traffic than that you will need replicated storage, load balancing etc, at which point you are effectively implementing a object storage system yourself.
I have a Web Application (Java backend) that processes a large amount of raw data that is uploaded from a hardware platform containing a number of sensors.
Currently the raw data is uploaded and the data is decompressed and stored as a 'text' field in a Postgresql database to allow the users to log in and generate various graphs / charts of the data (using a JS charting library clientside).
Example string...
[45,23,45,32,56,75,34....]
The arrays will typically contain ~300,000 values but this could be up to 1,000,000 depending on how long the sensors are recording so the size of the string being stored could be a few hundred kilobytes
This currently seems to work fine for now as there are only ~200 uploads per day but as I am looking at the scalability of the application and the ability to backup the data I am looking at alternatives for storing this data
DynamoDB looked like a great option for me as I can carry on storing the uploads details in my SQL table and just save a URL endpoint to be called to retrieve the arrays....but then I noticed the item size is limited to 64kb
As I am sure there are a million and one ways to do this I would like to put this out to the SO community to hear what others would recommend, either web services or locally stored....considering performance, scalability, maintainability etc etc...
Thanks in advance!
UPDATE:
Just to clarify the data shown above is just the 'Y' values as it is time-sampled the X values are taken as the position in the array....so I dont think storing as a tuple would have any benefits.
If you are looking to store such strings, you probably want to use S3 (1 object containing
the array string), in this case you will have "backup" out of the box by enabling bucket
versioning.
You can try tuple of Couchbase and ElasticSearch. Couchbase is very fast document-oriented NoSql database. Several thousands of insert operation is normal for CB. Item size is limited to 20MB. Performance of "get" operation is several tens of thousands. There is one disadvantage, you can query data only by id (there is "view", but I think it will be too difficult to adapt them to the plotting). Compensate for this deficiency may ElasticSearch, that can perform any query very fast. Format data in Couchbase and ElasticSearch is json-document.
I have just come across Google Cloud Datastore, which allows me to store single item Strings up to 1Mb (un-indexed), seems like a good alternative to Dynamo
May be you should use Redis or SSDB, both are designed to store large list(array) of data. The difference between these two databases is that Redis is memory only(disk for backup), but SSDB is disk based and uses memory as cache.
Users of our platform will have large amounts of stored data on our system. Through an application, once connected, that data will be transferred to them and no longer need to remain on our servers. There could potentially be hundreds or thousands of users connected at any given time, performing their downloads.
Here's the proposed architecture:
User management, configuration, and data download statistics will be maintained in a SQL Server database, while using either Redis or DynamoDB for the large data sets.
The reason for choosing either Redis or DynamoDB is based on cost - cheaper than running another SQL Server instance, and performance. The data format will be similar to a datamart - flat table with no joins.
Initially the queries would be simple - get all data for user X between a date range, and optionally delete.
Since we may want to add free text searching for certain fields of that data using elasticsearch may be a better option to use from the get-go.
I want this to be auto-scaling but not sure which database would be best to use for this scenario.
Here's some great discussion on Database + Search tier from AWS ReInvent:
https://youtu.be/K7o5OlRLtvU?t=1574
I would not take Elastic-search alone because it does not provide auto-scaling for writing capacity. In fact, it's not trivial to augment the number of shard of an index. Secondly it can only handle the JSON format, which could be an issue for you.
Redis could be a good idea because it is really fast, everything is done in RAM, and it provides keys with a limited time-to-live which could be interesting for you. Unfortunately, if your data size exceeds the capacity in RAM of your amazon instance you will have to shard your Redis database. And Redis does not support it, you will have to deal it on your application code. Moreover, as far as I know Redis does not handle complex queries. You will also need to save your data in a Redis data structure which could be an issue for you
DynamoDB handles auto-scaling really well but on the other hand it is a key/value database so it does not allow you to make queries like "get all data for user X between a date range". DynamoDB also allows you to save your data in any format.
The solution will be to use either DynamoDB or either Redis depending of the size of your datas, and to use ElasticSearch in order to index your key with only the meta-data (user and dates). Like that your index will be small, and if you lost the ability to index because of ElasticSearch get too buzy, you keep the ability to save user's datas.
I am hoping someone can explain how to use BLOBs. I see that BLOBs can be used to store video files. My question is why would a person store a video file in a BLOB in a SQL database? What are the advantages and disadvantages compared to storing pointers to the location of the video file?
A few different reasons.
If you store a pointer to a file on disk (presumably using the BFILE data type), you have to ensure that your database is updated whenever files are moved, renamed, or deleted on disk. It's relatively common when you store data like this that over time your database gets out of sync with the file system and you end up with broken links and orphaned content.
If you store a pointer to a file on disk, you cannot use transactional semantics when you're dealing with multimedia. Since you can't do something like issue a rollback against a file system, you either have to deal with the fact that you're going to have situations where the data on the file system doesn't match the data in the database (i.e. someone uploaded a video to the file system but the transaction that created the author and title in the database failed or vice versa) or you have to add additional steps to the file upload to simulate transactional semantics (i.e. upload a second <>_done.txt file that just contains the number of bytes in the actual file that was uploaded. That's cumbersome and error-prone and may create usability issues.
For many applications, having the database serve up data is the easiest way to provide it to a user. If you want to avoid giving a user a direct FTP URL to your files because they could use that to bypass some application-level security, the easiest option is to have a database-backed application where to retrieve the data, the database reads it from the file system and then returns it to the middle tier which then sends the data to the client. If you're going to have to read the data into the database every time the data is retrieved, it often makes more sense to just store the data directly in the database and to let the database read it from its data files when the user asks for it.
Finally, databases like Oracle provide additional utilities for working with multimedia data in the database. Oracle interMedia, for example, provides a rich set of objects to interact with video data stored in the database-- you can easily tag where scenes begin or end, tag where various subjects are discussed, when the video was recorded, who recorded it, etc. And you can integrate that search functionality with searches against all your relational data. Of course, you could write an application on top of the database that did all those things as well but then you're either writing a lot of code or using another framework in your app. It's often much easier to leverage the database functionality.
Take a read of this : http://www.oracle.com/us/products/database/options/spatial/039950.pdf
(obviously a biased view, but does have a few cons (that have now been fixed by the advent of 11g)