riak backup solution for a single bucket - backup

What are your recommendations for solutions that allow backing up [either by streaming or snapshot] a single riak bucket to a file?

Backing up just a single bucket is going to be a difficult operation in Riak.
All of the solutions will boil down to the following two steps:
List all of the objects in the bucket. This is the tricky part, since there is no "manifest" or a list of contents of any bucket, anywhere in the Riak cluster.
Issue a GET to each one of those objects from the list above, and write it to a backup file. This part is generally easy, though for maximum performance you want to make sure you're issuing those GETs in parallel, in a multithreaded fashion, and using some sort of connection pooling.
As far as listing all of the objects, you have one of three choices.
One is to do a Streaming List Keys operation on the bucket via HTTP (e.g. /buckets/bucket/keys?keys=stream) or Protocol Buffers -- see http://docs.basho.com/riak/latest/dev/references/http/list-keys/ and http://docs.basho.com/riak/latest/dev/references/protocol-buffers/list-keys/ for details. Under no circumstances should you do a non-streaming regular List Keys operation. (It will hang your whole cluster, and will eventually either time out or crash once the number of keys grows large enough).
Two is to issue a Secondary Index (2i) query to get that object list. See http://docs.basho.com/riak/latest/dev/using/2i/ for discussion and caveats.
And three would be if you're using Riak Search and can retrieve all of the objects via a single paginated search query. (However, Riak Search has a query result limit of 10,000 results, so, this approach is far from ideal).
For an example of a standalone app that can backup a single bucket, take a look at Riak Data Migrator, an experimental Java app that uses the Streaming List Keys approach combined with efficient parallel GETs.

The Basho function contrib has an erlang solution for backing up a single bucket. It is a custom function but it should do the trick.
http://contrib.basho.com/bucket_exporter.html

As far as I know, there's no automated solution to backup a single bucket in Riak. You'd have to use the riak-admin command line tool to take care of backing up a single physical node. You could write something to retrieve all keys in a single bucket and using low r values if you want it to be fast but not secure (r = 1).
Buckets are a logical namespace, all of the keys are stored in the same bitcask structure. That's why the only way to get just a single node is to write a tool to stream them yourself.

Related

How to save millions of files in S3 so that arbitrary future searches on key/path values are fast

My company has millions of files in an S3 bucket, and every so often I have to search for files whose keys/paths contain some text. This is an extremely slow process because I have to iterate through all files.
I can't use prefix because the text of interest is not always at the beginning. I see other posts (here and here) that say this is a known limitation in S3's API. These posts are from over 3 years ago, so my first question is: does this limitation still exist?
Assuming the answer is yes, my next question is, given that I anticipate arbitrary regex-like searches over millions of S3 files, are there established best practices for workarounds? I've seen some people say that you can store the key names in a relational database, Elasticsearch, or a flat file. Are any of these approaches more common place than others?
Also, out of curiosity, why hasn't S3 supported such a basic use case in a service (S3) that is such an established core product of the overall AWS platform? I've noticed that GCS on Google Cloud has a similar limitation. Is it just really hard to do searches on key name strings well at scale?
S3 is an object store, conceptually similar to a file system. I'd never try to make a database-like environment based on file names in a file system nor would I in S3.
Nevertheless, if this is what you have then I would start by running code to get all of the current file names into a database of some sort. DynamoDB cannot query by regular expression but any of PostgreSQL, MySQL, Aurora, and ElasticSearch can. So start with listing every file and put the file name and S3 location into a database-like structure. Then, create a Lambda that is notified of any changes (see this link for more info) that will do the appropriate thing with your backing store when a file is added or deleted.
Depending on your needs ElasticSearch is super flexible with queries and possibly better suited for these types of queries. But traditional relational database can be made to work too.
Lastly, you'll need an interface to the backing store to query. That will likely require some sort of server. That could be a simple as API gateway to a Lambda or something far more complex.
You might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file containing a list of all objects in the bucket.
You could then load this file into a database, or even write a script to parse it. Or possibly even just play with it in Excel.

When would you want to make s3 object keys similiar

So S3 uses the object key in partitioning data, and that you should make your keys with some randomness to distribute workloads across multiple partitions. My question is are there any scenarios in which you would want to have similar keys? And if not, why then would AWS use the key to partition your data instead of randomly partitioning data itself?
I ask this because I see it as an odd design as it makes it easy for developers to make mistakes in their partitioning if they generate keys which have a pattern, but it also prevents developers from creating keys in a logical manner as this would undoubtedly result in a pattern and the data being partitioned incorrectly.
You appear to be referring to Request Rate and Performance Considerations - Amazon Simple Storage Service, which states:
The Amazon S3 best practice guidelines in this topic apply only if you are routinely processing 100 or more requests per second. If your typical workload involves only occasional bursts of 100 requests per second and fewer than 800 requests per second, you don't need to follow these guidelines.
This is unlikely to affect most applications, but if applications do have such high traffic, then spreading requests across the keyname space can improve performance.
AWS has not explained why they have designed Amazon S3 in this manner.
So S3 uses the object key in partitioning data
Wait. Your question seems premised on this assumption, but it isn't correct.
S3 does not use the object key to partition the data. That would indeed, as you suggest, be a very "odd design" (or worse).
S3 uses the object key to partition the index of objects in the bucket -- otherwise the index of objects would be stored in an order that would not support enumerating the object keys in sorted order which would also eliminate the ability to list objects by prefix, or identify common prefixes using delimiters -- or there would need to be a secondary index, which would just compound the potential scaling issue and move the same problem down one level.
The case for similar keys is when you want to find objects with a common prefix (in the same "folder") on demand. Storing log files is an easy example, yyyy/mm/dd/.... Note that when the various services store log files in buckets for you (S3 logs, CloudFront, ELB), the object keys are sequential like this, because the date and time are in the object key.
When S3 does a partition split, only the index is split. The data is already durably stored and doesn't move. The potential performance considerations are related to the performance of the index, not that of the actual storage of the object data.

What is a recommended scalable DB platform to use in AWS for large amounts of volatile data sets - elasticsearch, Redis or DynamoDB?

Users of our platform will have large amounts of stored data on our system. Through an application, once connected, that data will be transferred to them and no longer need to remain on our servers. There could potentially be hundreds or thousands of users connected at any given time, performing their downloads.
Here's the proposed architecture:
User management, configuration, and data download statistics will be maintained in a SQL Server database, while using either Redis or DynamoDB for the large data sets.
The reason for choosing either Redis or DynamoDB is based on cost - cheaper than running another SQL Server instance, and performance. The data format will be similar to a datamart - flat table with no joins.
Initially the queries would be simple - get all data for user X between a date range, and optionally delete.
Since we may want to add free text searching for certain fields of that data using elasticsearch may be a better option to use from the get-go.
I want this to be auto-scaling but not sure which database would be best to use for this scenario.
Here's some great discussion on Database + Search tier from AWS ReInvent:
https://youtu.be/K7o5OlRLtvU?t=1574
I would not take Elastic-search alone because it does not provide auto-scaling for writing capacity. In fact, it's not trivial to augment the number of shard of an index. Secondly it can only handle the JSON format, which could be an issue for you.
Redis could be a good idea because it is really fast, everything is done in RAM, and it provides keys with a limited time-to-live which could be interesting for you. Unfortunately, if your data size exceeds the capacity in RAM of your amazon instance you will have to shard your Redis database. And Redis does not support it, you will have to deal it on your application code. Moreover, as far as I know Redis does not handle complex queries. You will also need to save your data in a Redis data structure which could be an issue for you
DynamoDB handles auto-scaling really well but on the other hand it is a key/value database so it does not allow you to make queries like "get all data for user X between a date range". DynamoDB also allows you to save your data in any format.
The solution will be to use either DynamoDB or either Redis depending of the size of your datas, and to use ElasticSearch in order to index your key with only the meta-data (user and dates). Like that your index will be small, and if you lost the ability to index because of ElasticSearch get too buzy, you keep the ability to save user's datas.

DynamoDb + S3 + CloudSearch + Redis

I'm currently creating a scheme for my application and I'm wondering if my thinking is right
Example : Ecommerce site
In DynamoDb, I would put products ( product_id, meta-data link to S3)
S3, i would use it for storing Search Data Format (SDF/JSON)
(Product name, product description, price, ...etc )
Amazon CloudSearch would be used to index documents in S3, and to be able to search them.
Redis would be used to cache results
Is my scheme right? Can s3 be a good "database" ?
Is DynamoDb here even needed?
If you are thinking that S3 would just be the source of record for your products and you are not expecting heavy reads/writes, then it COULD work, but you have to recognize that it will be far far slower than using a real database. Not just 1-2x slower but MANY magnitudes slower. We use S3 for storing audit data for realtime data stored in Postgres - works a charm, but this is data that is written once and read rarely. Retrieval times when it does have to retrieve audit records is > 50ms. This type of speed is usually not acceptable when you need to manipulate multiple records at one time.
If you are going to be using dynamoDB anyway, why not just use that to store what you'd be storing on s3? Trying to adhere to the concept of keep it simple, I would use the following stack:
dynamoDB to be the system of record and to do some searches
Cloudsearch for more flexible searching than what dynamodb can
provide
S3 for static files (product images, etc.)
And again, to keep things simple, Skip Redis for caching if you are already using dynamoDB and don't plan on using any of Redis' specialized dastatypes - ie: your caching will be nothing more than keys to strings, etc. Use Redis if you plan on taking advantage of its other datatypes or if you want to have a cache closer to your app - ie: you plan on using Redis on the webserver.
Dynamo is used for storing write-extensive data. If your application does not require extensive writes over product_id and meta-data, I think RDS/MySQL is better.
When designing an application, you really should keep things as simple as possible from the beginning. It will always get worse with time :)
S3 is not a good DB. It has not been designed for this and is too slow. It's for file storage only. If you want to stick with DynamoDB, you should put all your products info in it, including the metadata.
CloudSearch may be a good option. You can also build you own "indexes" on top of DynamoDB. It requires more design and programming but might be worth considering. Here is a link to an excellent blog-post on this matter: http://blog.coredumped.org/2012/01/amazon-dynamodb.html.
So,
Is DynamoDB even needed: Yes, or RDS, Mongo,... any real DB depending on you needs.
Is S3 a good DB: I don't think so.

Is this a good use-case for Redis on a ServiceStack REST API?

I'm creating a mobile app and it requires a API service backend to get/put information for each user. I'll be developing the web service on ServiceStack, but was wondering about the storage. I love the idea of a fast in-memory caching system like Redis, but I have a few questions:
I created a sample schema of what my data store should look like. Does this seems like it's a good case for using Redis as opposed to a MySQL DB or something like that?
schema http://www.miles3.com/uploads/redis.png
How difficult is the setup for persisting the Redis store to disk or is it kind of built-in when you do writes to the store? (I'm a newbie on this NoSQL stuff)
I currently have my setup on AWS using a Linux micro instance (because it's free for a year). I know many factors go into this answer, but in general will this be enough for my web service and Redis? Since Redis is in-memory will that be enough? I guess if my mobile app skyrockets (hey, we can dream right?) then I'll start hitting the ceiling of the instance.
What to think about when desigining a NoSQL Redis application
1) To develop correctly in Redis you should be thinking more about how you would structure the relationships in your C# program i.e. with the C# collection classes rather than a Relational Model meant for an RDBMS. The better mindset would be to think more about data storage like a Document database rather than RDBMS tables. Essentially everything gets blobbed in Redis via a key (index) so you just need to work out what your primary entities are (i.e. aggregate roots)
which would get kept in its own 'key namespace' or whether it's non-primary entity, i.e. simply metadata which should just get persisted with its parent entity.
Examples of Redis as a primary Data Store
Here is a good article that walks through creating a simple blogging application using Redis:
http://www.servicestack.net/docs/redis-client/designing-nosql-database
You can also look at the source code of RedisStackOverflow for another real world example using Redis.
Basically you would need to store and fetch the items of each type separately.
var redisUsers = redis.As<User>();
var user = redisUsers.GetById(1);
var userIsWatching = redisUsers.GetRelatedEntities<Watching>(user.Id);
The way you store relationship between entities is making use of Redis's Sets, e.g: you can store the Users/Watchers relationship conceptually with:
SET["ids:User>Watcher:{UserId}"] = [{watcherId1},{watcherId2},...]
Redis is schema-less and idempotent
Storing ids into redis sets is idempotent i.e. you can add watcherId1 to the same set multiple times and it will only ever have one occurrence of it. This is nice because it means you don't ever need to check the existence of the relationship and can freely keep adding related ids like they've never existed.
Related: writing or reading to a Redis collection (e.g. List) that does not exist is the same as writing to an empty collection, i.e. A list gets created on-the-fly when you add an item to a list whilst accessing a non-existent list will simply return 0 results. This is a friction-free and productivity win since you don't have to define your schemas up front in order to use them. Although should you need to Redis provides the EXISTS operation to determine whether a key exists or a TYPE operation so you can determine its type.
Create your relationships/indexes on your writes
One thing to remember is because there are no implicit indexes in Redis, you will generally need to setup your indexes/relationships needed for reading yourself during your writes. Basically you need to think about all your query requirements up front and ensure you set up the necessary relationships at write time. The above RedisStackOverflow source code is a good example that shows this.
Note: the ServiceStack.Redis C# provider assumes you have a unique field called Id that is its primary key. You can configure it to use a different field with the ModelConfig.Id() config mapping.
Redis Persistance
2) Redis supports 2 types persistence modes out-of-the-box RDB and Append Only File (AOF). RDB writes routine snapshots whilst the Append Only File acts like a transaction journal recording all the changes in-between snapshots - I recommend adding both until your comfortable with what each does and what your application needs. You can read all Redis persistence at http://redis.io/topics/persistence.
Note Redis also supports trivial replication you can read more about at: http://redis.io/topics/replication
Redis loves RAM
3) Since Redis operates predominantly in memory the most important resource is that you have enough RAM to hold your entire dataset in memory + a buffer for when it snapshots to disk. Redis is very efficient so even a small AWS instance will be able to handle a lot of load - what you want to look for is having enough RAM.
Visualizing your data with the Redis Admin UI
Finally if you're using the ServiceStack C# Redis Client I recommend installing the Redis Admin UI which provides a nice visual view of your entities. You can see a live demo of it at:
http://servicestack.net/RedisAdminUI/AjaxClient/