DynamoDb + S3 + CloudSearch + Redis - amazon-s3

I'm currently creating a scheme for my application and I'm wondering if my thinking is right
Example : Ecommerce site
In DynamoDb, I would put products ( product_id, meta-data link to S3)
S3, i would use it for storing Search Data Format (SDF/JSON)
(Product name, product description, price, ...etc )
Amazon CloudSearch would be used to index documents in S3, and to be able to search them.
Redis would be used to cache results
Is my scheme right? Can s3 be a good "database" ?
Is DynamoDb here even needed?

If you are thinking that S3 would just be the source of record for your products and you are not expecting heavy reads/writes, then it COULD work, but you have to recognize that it will be far far slower than using a real database. Not just 1-2x slower but MANY magnitudes slower. We use S3 for storing audit data for realtime data stored in Postgres - works a charm, but this is data that is written once and read rarely. Retrieval times when it does have to retrieve audit records is > 50ms. This type of speed is usually not acceptable when you need to manipulate multiple records at one time.
If you are going to be using dynamoDB anyway, why not just use that to store what you'd be storing on s3? Trying to adhere to the concept of keep it simple, I would use the following stack:
dynamoDB to be the system of record and to do some searches
Cloudsearch for more flexible searching than what dynamodb can
provide
S3 for static files (product images, etc.)
And again, to keep things simple, Skip Redis for caching if you are already using dynamoDB and don't plan on using any of Redis' specialized dastatypes - ie: your caching will be nothing more than keys to strings, etc. Use Redis if you plan on taking advantage of its other datatypes or if you want to have a cache closer to your app - ie: you plan on using Redis on the webserver.

Dynamo is used for storing write-extensive data. If your application does not require extensive writes over product_id and meta-data, I think RDS/MySQL is better.

When designing an application, you really should keep things as simple as possible from the beginning. It will always get worse with time :)
S3 is not a good DB. It has not been designed for this and is too slow. It's for file storage only. If you want to stick with DynamoDB, you should put all your products info in it, including the metadata.
CloudSearch may be a good option. You can also build you own "indexes" on top of DynamoDB. It requires more design and programming but might be worth considering. Here is a link to an excellent blog-post on this matter: http://blog.coredumped.org/2012/01/amazon-dynamodb.html.
So,
Is DynamoDB even needed: Yes, or RDS, Mongo,... any real DB depending on you needs.
Is S3 a good DB: I don't think so.

Related

How to save millions of files in S3 so that arbitrary future searches on key/path values are fast

My company has millions of files in an S3 bucket, and every so often I have to search for files whose keys/paths contain some text. This is an extremely slow process because I have to iterate through all files.
I can't use prefix because the text of interest is not always at the beginning. I see other posts (here and here) that say this is a known limitation in S3's API. These posts are from over 3 years ago, so my first question is: does this limitation still exist?
Assuming the answer is yes, my next question is, given that I anticipate arbitrary regex-like searches over millions of S3 files, are there established best practices for workarounds? I've seen some people say that you can store the key names in a relational database, Elasticsearch, or a flat file. Are any of these approaches more common place than others?
Also, out of curiosity, why hasn't S3 supported such a basic use case in a service (S3) that is such an established core product of the overall AWS platform? I've noticed that GCS on Google Cloud has a similar limitation. Is it just really hard to do searches on key name strings well at scale?
S3 is an object store, conceptually similar to a file system. I'd never try to make a database-like environment based on file names in a file system nor would I in S3.
Nevertheless, if this is what you have then I would start by running code to get all of the current file names into a database of some sort. DynamoDB cannot query by regular expression but any of PostgreSQL, MySQL, Aurora, and ElasticSearch can. So start with listing every file and put the file name and S3 location into a database-like structure. Then, create a Lambda that is notified of any changes (see this link for more info) that will do the appropriate thing with your backing store when a file is added or deleted.
Depending on your needs ElasticSearch is super flexible with queries and possibly better suited for these types of queries. But traditional relational database can be made to work too.
Lastly, you'll need an interface to the backing store to query. That will likely require some sort of server. That could be a simple as API gateway to a Lambda or something far more complex.
You might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file containing a list of all objects in the bucket.
You could then load this file into a database, or even write a script to parse it. Or possibly even just play with it in Excel.

What is a recommended scalable DB platform to use in AWS for large amounts of volatile data sets - elasticsearch, Redis or DynamoDB?

Users of our platform will have large amounts of stored data on our system. Through an application, once connected, that data will be transferred to them and no longer need to remain on our servers. There could potentially be hundreds or thousands of users connected at any given time, performing their downloads.
Here's the proposed architecture:
User management, configuration, and data download statistics will be maintained in a SQL Server database, while using either Redis or DynamoDB for the large data sets.
The reason for choosing either Redis or DynamoDB is based on cost - cheaper than running another SQL Server instance, and performance. The data format will be similar to a datamart - flat table with no joins.
Initially the queries would be simple - get all data for user X between a date range, and optionally delete.
Since we may want to add free text searching for certain fields of that data using elasticsearch may be a better option to use from the get-go.
I want this to be auto-scaling but not sure which database would be best to use for this scenario.
Here's some great discussion on Database + Search tier from AWS ReInvent:
https://youtu.be/K7o5OlRLtvU?t=1574
I would not take Elastic-search alone because it does not provide auto-scaling for writing capacity. In fact, it's not trivial to augment the number of shard of an index. Secondly it can only handle the JSON format, which could be an issue for you.
Redis could be a good idea because it is really fast, everything is done in RAM, and it provides keys with a limited time-to-live which could be interesting for you. Unfortunately, if your data size exceeds the capacity in RAM of your amazon instance you will have to shard your Redis database. And Redis does not support it, you will have to deal it on your application code. Moreover, as far as I know Redis does not handle complex queries. You will also need to save your data in a Redis data structure which could be an issue for you
DynamoDB handles auto-scaling really well but on the other hand it is a key/value database so it does not allow you to make queries like "get all data for user X between a date range". DynamoDB also allows you to save your data in any format.
The solution will be to use either DynamoDB or either Redis depending of the size of your datas, and to use ElasticSearch in order to index your key with only the meta-data (user and dates). Like that your index will be small, and if you lost the ability to index because of ElasticSearch get too buzy, you keep the ability to save user's datas.

S3 or EBS for storing data in flat files

I have flat files in which I store data and retrieve it instead of storing to database. This is temporary and may last for couple of months.I was wondering If I should be using EBS or S3. EBS is mainly used for I/O , S3 for content delivery , but S3 is on use you go model and EBS is you have to pay for the volume purchased ?
Pls guide, which one is better ?
S3 sounds like it's more appropriate for your use case.
S3 is object storage. Think of it as an Amazon-run file server. (Objects are not exactly equal to files, but it's close enough here.) You tell S3 to put a file, it'll store it. You tell S3 to get a file, it'll get return it. You tell S3 to delete it, it's gone. This is easy to work with and very scalable.
EBS is block storage. Think of it as an Amazon-run external hard drive. You can plug an EBS volume into an EC2 virtual machine, or you access it over the Internet via AWS Storage Gateway. Like an external hard drive, you can only plug it into one computer at a time. The size is set up front, and while there are ways to grow and shrink it, you're paying for all the bits all the time. It's also much more complex than S3, since it has to provide strong consistency guarantees for the entire volume, instead of just on a file-by-file basis.
To build on the good answer from willglynn. If you are interacting with the data regularly, or need more file-system-like access you might consider EBS more strongly.
If the amount of data is relatively small and you read and write to the data store regularly, you might consider something like elasticache for in-memory storage which would likely be superior performance-wise then using s3 or EBS.
Similarly, you might look at DynamoDb for document type storage, especially if you need to be able to search/filter across your data objects.
Point 1) You can use both S3 and EBS for this option. If you want reduced latency and file sizes are bigger then EBS is better option.
Point 2) If you want lower costs, then S3 is a better option.
From what you describe, S3 will be the most cost-effective and likely easiest solution.
Pros to S3:
1. You can access the data from anywhere. You don't need to spin up an EC2 instance.
2. Crazy data durability numbers.
3. Nice versioning story around buckets.
4. Cheaper than EBS
Pros to EBS
1. Handy to have the data on a file system in EC2. That let you do normal processing with the Unix pipeline.
2. Random Access patterns work as you would expect.
3. It's a drive. Everyone knows how to deal with files on drives.
If you want to get away from a flat file, DynamoDB provides a nice set of interfaces for putting lots and lots of rows into a table, then running operations against those rows.

riak backup solution for a single bucket

What are your recommendations for solutions that allow backing up [either by streaming or snapshot] a single riak bucket to a file?
Backing up just a single bucket is going to be a difficult operation in Riak.
All of the solutions will boil down to the following two steps:
List all of the objects in the bucket. This is the tricky part, since there is no "manifest" or a list of contents of any bucket, anywhere in the Riak cluster.
Issue a GET to each one of those objects from the list above, and write it to a backup file. This part is generally easy, though for maximum performance you want to make sure you're issuing those GETs in parallel, in a multithreaded fashion, and using some sort of connection pooling.
As far as listing all of the objects, you have one of three choices.
One is to do a Streaming List Keys operation on the bucket via HTTP (e.g. /buckets/bucket/keys?keys=stream) or Protocol Buffers -- see http://docs.basho.com/riak/latest/dev/references/http/list-keys/ and http://docs.basho.com/riak/latest/dev/references/protocol-buffers/list-keys/ for details. Under no circumstances should you do a non-streaming regular List Keys operation. (It will hang your whole cluster, and will eventually either time out or crash once the number of keys grows large enough).
Two is to issue a Secondary Index (2i) query to get that object list. See http://docs.basho.com/riak/latest/dev/using/2i/ for discussion and caveats.
And three would be if you're using Riak Search and can retrieve all of the objects via a single paginated search query. (However, Riak Search has a query result limit of 10,000 results, so, this approach is far from ideal).
For an example of a standalone app that can backup a single bucket, take a look at Riak Data Migrator, an experimental Java app that uses the Streaming List Keys approach combined with efficient parallel GETs.
The Basho function contrib has an erlang solution for backing up a single bucket. It is a custom function but it should do the trick.
http://contrib.basho.com/bucket_exporter.html
As far as I know, there's no automated solution to backup a single bucket in Riak. You'd have to use the riak-admin command line tool to take care of backing up a single physical node. You could write something to retrieve all keys in a single bucket and using low r values if you want it to be fast but not secure (r = 1).
Buckets are a logical namespace, all of the keys are stored in the same bitcask structure. That's why the only way to get just a single node is to write a tool to stream them yourself.

S3 to EC2 Performance for fetching large numbers of small files

I have a large collection of data chunks sized 1kB (in the order of several hundred million), and need a way to store and query these data chunks. The data chunks are added, but never deleted or updated. Our service is deployed on the S3, EC2 platform.
I know Amazon SimpleDB exists, but I want a solution that is platform agnostic (in case we need to move out of AWS for example).
So my question is, what are the pro's and con's of these two options for storing and retrieving data chunks. How would the performance compare?
Store the data chunks as files on S3 and GET them when needed
Store the data chunks on a MySQL Server cluster
Would there be that much of a performance difference?
I tried using S3 as a sort of "database" using tiny XML files to hold my structured data objects, and relying on the S3 "keys" to look up these objects.
The performance was unacceptable, even from EC2 - the latency to S3 is just too high.
Running MySQL on an EBS device will be an order of magnitude faster, even with so many records.
Do you need to provide access to these data chunks directly to the users of your application? If not, then S3 and HTTP GET requests are an overkill. Having also in mind that S3 is a secured service, the overhead for every GET request (for just 1KB of data) will be considerably large.
MySQL server cluster would be a better idea, but to run in EC2 you need to employ Elastic Block Storage. Finally, do not rule out SimpleDB. It is perhaps the best solution for your problem. Design your system carefully and you would be able to easily migrate in other database systems (distributed or relational) in the future.