YAML to data storage - sql

Can I use YML file-flat system to data storage instead of SQL? I need to storage sensitive data, like
monetary values.

In general, YAML can be used in a single-processing system. If you are considering secure storage for a web server, which is multithreaded, YAML would not be appropriate unless you allocated a new YAML file with a guaranteed unique name, perhaps using a GUID.
Then you would have many plaintext files containing sensitive data on a web server. That is inherently insecure. I do not recommend storing sensitive data in plaintext.
Databases have been used to store monetary values for decades. For example, banks use databases for nearly everything. Unclear why your use case might be different.

Related

Need for metadata store while storing an object

While checking out the design of a service like pastebin, I noticed the usage of two different storage systems:
An object store(such as Amazon S3) for storing the actual "paste" data
A metadata store to store other things pertaining to that "paste" data; such as - URL Hash(to access that paste data), Reference to the actual paste data etc.
I am trying to understand the need for this metadata store.
Is this generally the recommended way? Any specific advantage we get from using the metadata store?
Do object storage systems NOT allow metadata to be stored along with the actual object in the same storage server?
Object storage systems generally do allow quite a lot of metadata to be attached to the object.
But then your metadata is at the mercy of the object store.
Your metadata search is limited to what the object store allows.
Analysis, notification (a-la inotify) etc. are at limited to what the object store allows.
If you wanted to move from S3 to Google Cloud Storage, or to do both, you'd have to normalize your metadata.
Your metadata size limitations are limited to that of the object store.
You can't do cross-object-store metadata (e.g. a link that refers to multiple paste data).
You might not be able to have binary metdata.
Typically, metadata is both very important, and very heavily used by the business, so it has separate usage characteristics than the data, so it makes sense to put it on storage with different characteristics.
I can't find anywhere how pastebin.com makes money, so I don't know how heavily they use metadata, but merely the lookup, the translation between URL and paste data, is not something you can do securely with object storage alone.
Great answer above, just to add on - two more advantages are caching and scaling up both storage systems individually.
If you just use an object storage, and say a paste is 5 MB, would you cache all of it? Metadata storage also allows to improve UX by caching say first 10 or 100 KB of data for a paste for the user to preview, meanwhile the complete object is fetched in the background. This upper bound also helps to design cache deterministically.
You can also scale the object store and the metadata store independently of each other as per performance/ capacity needs. Lookups in the metadata store will also be quicker since it's less bulkier.
Your concern is legitimate that separating the storage into 2 tables (or mediums) does add some latency, but it's always a compromise with System Design, there is hardly a Win-Win situation.

How to save millions of files in S3 so that arbitrary future searches on key/path values are fast

My company has millions of files in an S3 bucket, and every so often I have to search for files whose keys/paths contain some text. This is an extremely slow process because I have to iterate through all files.
I can't use prefix because the text of interest is not always at the beginning. I see other posts (here and here) that say this is a known limitation in S3's API. These posts are from over 3 years ago, so my first question is: does this limitation still exist?
Assuming the answer is yes, my next question is, given that I anticipate arbitrary regex-like searches over millions of S3 files, are there established best practices for workarounds? I've seen some people say that you can store the key names in a relational database, Elasticsearch, or a flat file. Are any of these approaches more common place than others?
Also, out of curiosity, why hasn't S3 supported such a basic use case in a service (S3) that is such an established core product of the overall AWS platform? I've noticed that GCS on Google Cloud has a similar limitation. Is it just really hard to do searches on key name strings well at scale?
S3 is an object store, conceptually similar to a file system. I'd never try to make a database-like environment based on file names in a file system nor would I in S3.
Nevertheless, if this is what you have then I would start by running code to get all of the current file names into a database of some sort. DynamoDB cannot query by regular expression but any of PostgreSQL, MySQL, Aurora, and ElasticSearch can. So start with listing every file and put the file name and S3 location into a database-like structure. Then, create a Lambda that is notified of any changes (see this link for more info) that will do the appropriate thing with your backing store when a file is added or deleted.
Depending on your needs ElasticSearch is super flexible with queries and possibly better suited for these types of queries. But traditional relational database can be made to work too.
Lastly, you'll need an interface to the backing store to query. That will likely require some sort of server. That could be a simple as API gateway to a Lambda or something far more complex.
You might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file containing a list of all objects in the bucket.
You could then load this file into a database, or even write a script to parse it. Or possibly even just play with it in Excel.

data exchange format to use with Apache Kakfa that provides schema validation

what is the best message format to use with apachhe kafka so that producers and consumers can define contract and validate data and serialize/deserialize data? for example in xml we have xsd. but in json there is no universal schema.. i read about using apache avro but not sure how fast will it be as i can't afford more then 5 to 6 ms for schema validation and deserialisation? any inputs please?
We will be processing thousands of transactions per second and SLA for each transaction is 150ms so i am looking for something that's very fast
Avro is often quoted as being slow(er), and adds overhead compared to other binary formats, but I believe that is for the use-case of not using a Schema Registry where the schema is excluded from the actual payload.
Alternatively, you can use Protobuf or Thrift if you absolutely want a schema, however, I don't think serializers for these formats are readily available, from what I've seen. Plus, the schemas need to be passed between your clients if not otherwise committed to a central location.
I can confidently say that Avro should be fine for starting out, though, and the Registry is definitely useful, and not just for Kafka use cases.

Splitting Sensenet content repository into multiple databases

Is there a way of splitting a Content repository into multiple databases? There is a great chance I'll have TBs of data, maybe even tens of TBs of data. Maintaining database bigger than 1 TB becomes an issue, so I can't imagine dealing with a bigger database. I've considered using Filestream, but having multiple databases would be much more viable solution.
If not, is there at least a way of having several repositories contained in a single web site?
Currently (as of version 7.2) sensenet requires a central database to connect to, you cannot split that into multiple parts.
There is the blob storage feature however that lets you store binaries outside of the main metadata database. You choose a blob storage implementation (e.g. the MongoDb blob provider), install it and you can start uploading files to sensenet. Binaries above a certain (configured) size will go to the external provider.
You'll have to take care of the backup of the blob storage though, because that is different for every db provider. At least the size of the metadata db will be significantly lower.

What is a recommended scalable DB platform to use in AWS for large amounts of volatile data sets - elasticsearch, Redis or DynamoDB?

Users of our platform will have large amounts of stored data on our system. Through an application, once connected, that data will be transferred to them and no longer need to remain on our servers. There could potentially be hundreds or thousands of users connected at any given time, performing their downloads.
Here's the proposed architecture:
User management, configuration, and data download statistics will be maintained in a SQL Server database, while using either Redis or DynamoDB for the large data sets.
The reason for choosing either Redis or DynamoDB is based on cost - cheaper than running another SQL Server instance, and performance. The data format will be similar to a datamart - flat table with no joins.
Initially the queries would be simple - get all data for user X between a date range, and optionally delete.
Since we may want to add free text searching for certain fields of that data using elasticsearch may be a better option to use from the get-go.
I want this to be auto-scaling but not sure which database would be best to use for this scenario.
Here's some great discussion on Database + Search tier from AWS ReInvent:
https://youtu.be/K7o5OlRLtvU?t=1574
I would not take Elastic-search alone because it does not provide auto-scaling for writing capacity. In fact, it's not trivial to augment the number of shard of an index. Secondly it can only handle the JSON format, which could be an issue for you.
Redis could be a good idea because it is really fast, everything is done in RAM, and it provides keys with a limited time-to-live which could be interesting for you. Unfortunately, if your data size exceeds the capacity in RAM of your amazon instance you will have to shard your Redis database. And Redis does not support it, you will have to deal it on your application code. Moreover, as far as I know Redis does not handle complex queries. You will also need to save your data in a Redis data structure which could be an issue for you
DynamoDB handles auto-scaling really well but on the other hand it is a key/value database so it does not allow you to make queries like "get all data for user X between a date range". DynamoDB also allows you to save your data in any format.
The solution will be to use either DynamoDB or either Redis depending of the size of your datas, and to use ElasticSearch in order to index your key with only the meta-data (user and dates). Like that your index will be small, and if you lost the ability to index because of ElasticSearch get too buzy, you keep the ability to save user's datas.