I am using DynamoDB tables with keys and throughput optimized for application use cases. To support other ad hoc administrative and reporting use cases I want to keep a complete backup in S3 (a day old backup is OK). Again, I cannot afford to scan the entire DynamoDB tables to do the backup. The keys I have are not sufficient to find out what is "new". How do I do incremental backups? Do I have to modify my DynamoDB schema, or add extra tables just to do this? Any best practices?
Update:
DynamoDB Streams solves this problem.
DynamoDB Streams captures a time-ordered sequence of item-level
modifications in any DynamoDB table, and stores this information in a
log for up to 24 hours. Applications can access this log and view the
data items as they appeared before and after they were modified, in
near real time.
I see two options:
Generate the current snapshot. You'll have to read from the table to do this, which you can do at a very slow rate to stay under your capacity limits (Scan operation). Then, keep an in-memory list of updates performed for some period of time. You could put these in another table, but you'll have to read those, too, which would probably cost just as much. This time interval could be a minute, 10 minutes, an hour, whatever you're comfortable losing if your application exits. Then, periodically grab your snapshot from S3, replay these changes on the snapshot, and upload your new snapshot. I don't know how large your data set is, so this may not be practical, but I've seen this done with great success for data sets up to 1-2GB.
Add read throughput and backup your data using a full scan every day. You say you can't afford it, but it isn't clear if you mean paying for capacity, or that the scan would use up all the capacity and the application would begin failing. The only way to pull data out of DynamoDB is to read it, either strongly or eventually consistent. If the backup is part of your business requirements, then I think you have to determine if it's worth it. You can self-throttle your read by examining the ConsumedCapacityUnits property on your results. The Scan operation has a Limit property that you can use to limit the amount of data read in each operation. Scan also uses eventually consistent reads, which are half the price of strongly consistent reads.
You can now use dynamoDB streams to have data persisted into anthother table or maintain another copy of data in another datastore.
https://aws.amazon.com/blogs/aws/dynamodb-streams-preview/
For incremental backups, you can associate your DynamoDB Stream with a Lambda Function to automatically trigger code for every data update (Ie: data to another store like S3)
A lambda function you can use to tie up with DynamoDb for incremental backups:
https://github.com/PageUpPeopleOrg/dynamodb-replicator
I've provided a detailed walk through on how you can use DynamoDB Streams, Lambda and S3 versioned buckets to create incremental backups for your data in DynamoDb on my blog:
https://www.abhayachauhan.com/category/aws/dynamodb/dynamodb-backups
Alternatively, DynamoDB just realised On Demand backups and restores. They aren't incremental, but full backup snapshots.
Check out https://www.abhayachauhan.com/2017/12/dynamodb-scheduling-on-demand-backups/ for more information.
HTH
On November 29th, 2017 On-Demand Backup was introduced. It allows for you to create backups directly in DynamoDB essentially instantly without consuming any capacity. Here are a few snippets from the blog post:
This feature is designed to help you to comply with regulatory requirements for long-term archival and data retention. You can create a backup with a click (or an API call) without consuming your provisioned throughput capacity or impacting the responsiveness of your application. Backups are stored in a highly durable fashion and can be used to create fresh tables.
...
The backup is available right away! It is encrypted with an Amazon-managed key and includes all of the table data, provisioned capacity settings, Local and Global Secondary Index settings, and Streams. It does not include Auto Scaling or TTL settings, tags, IAM policies, CloudWatch metrics, or CloudWatch Alarms.
You may be wondering how this operation can be instant, given that some of our customers have tables approaching half of a petabyte. Behind the scenes, DynamoDB takes full snapshots and saves all change logs. Taking a backup is as simple as saving a timestamp along with the current metadata for the table.
A scan operation in DynamoDB returns rows sorted by primary key (hash key). So if a table's hash key is an auto-incremented integer, then set the hash key of the last record saved during the previous backup as "lastEvaluatedKey" parameter for a scan request when doing the next backup, and the scan will return records which have been created since the last backup only.
Related
We have analytical ETL store model results on a Snowflake table (2 columns: user-id, score)
We need to use that info in our low latency service, which snowflake is not suitable for that latency.
I thought about storing that table on a Redis collection.
I would like to have some idea of how to keep the Redis in sync with the table.
any other solution for the latency is also welcomed
well it depends on how frequently you snowflake data is updated, what process is updating the data (snowplow or some external tool that you can hock into), what latency you want, are prepared between the snowflake data change, and redis having the values.
You could and a task to export the changes to a S3 and then have a lambda watching the bucket/folder, and push the changes into redis.
You could have your tool that loads the changes, pull the changes out and push those into redis. (we did a form of this)
You could have something poll the snowflake data (seems the worst idea) and push changes into redis. Well if you are polling the main table, this sounds bad, but you could also have a multi-table insert/merge command, thus when you are updating the main table, insert into a changes or stream, and thus read from this in you redis sync.
I'm planning to use Data pipelines as a backup and recovery tool for our DynamoDB. We will be using amazon's prebuilt pipelines to backup to s3, and use the prebuilt recovery pipeline to recover to a new table in case of a disaster.
This will also serve a dual purpose of data archival for legal and compliance reasons. We have explored snapshots, but this can get quite expensive compared to s3. Does anyone have an estimate on how long it takes to backup a 1TB database? And How long it takes to recover a 1TB database?
I've read amazon docs and it says it can take up to 20 minutes to restore from a snapshot but no mention of how long for a data pipeline. Does anyone have any clues?
Does the newly released feature of exporting from DynamoDB to S3 do what you want for your use case? To use this feature, you must have continuous backups enabled though. Perhaps that will give you the short term backup you need?
It would be interesting to know why you're not planning to use the built-in backup mechanism. It offers point in time recovery and it is highly predictable in terms of cost and performance.
The Data Pipelines backup is unpredictable, will very likely cost more and operationally it is much less reliable. Plus getting a consistent snapshot (ie point in time) requires stopping the world. Speaking from experience, I don't recommend using Data Pipelines for backing up DynamoDB tables!
Regarding how long it takes to take a backup, that depends on a number of factors but mostly on the size of the table and the provisioned capacity you're willing to throw at it, as well as the size of the EMR cluster you're willing to work with. So, it could take anywhere from a minute to several hours.
Restoring time also depends on pretty much the same variables: provisioned capacity and total size. And it can also take anywhere from a minute to many hours.
Point in time backups offer consistent, predictable and most importantly reliable performance regardless of the size of the table: use that!
And if you're just interested in dumping the data from the table (i.e not necessarily the restore part) use the new export to S3.
Users of our platform will have large amounts of stored data on our system. Through an application, once connected, that data will be transferred to them and no longer need to remain on our servers. There could potentially be hundreds or thousands of users connected at any given time, performing their downloads.
Here's the proposed architecture:
User management, configuration, and data download statistics will be maintained in a SQL Server database, while using either Redis or DynamoDB for the large data sets.
The reason for choosing either Redis or DynamoDB is based on cost - cheaper than running another SQL Server instance, and performance. The data format will be similar to a datamart - flat table with no joins.
Initially the queries would be simple - get all data for user X between a date range, and optionally delete.
Since we may want to add free text searching for certain fields of that data using elasticsearch may be a better option to use from the get-go.
I want this to be auto-scaling but not sure which database would be best to use for this scenario.
Here's some great discussion on Database + Search tier from AWS ReInvent:
https://youtu.be/K7o5OlRLtvU?t=1574
I would not take Elastic-search alone because it does not provide auto-scaling for writing capacity. In fact, it's not trivial to augment the number of shard of an index. Secondly it can only handle the JSON format, which could be an issue for you.
Redis could be a good idea because it is really fast, everything is done in RAM, and it provides keys with a limited time-to-live which could be interesting for you. Unfortunately, if your data size exceeds the capacity in RAM of your amazon instance you will have to shard your Redis database. And Redis does not support it, you will have to deal it on your application code. Moreover, as far as I know Redis does not handle complex queries. You will also need to save your data in a Redis data structure which could be an issue for you
DynamoDB handles auto-scaling really well but on the other hand it is a key/value database so it does not allow you to make queries like "get all data for user X between a date range". DynamoDB also allows you to save your data in any format.
The solution will be to use either DynamoDB or either Redis depending of the size of your datas, and to use ElasticSearch in order to index your key with only the meta-data (user and dates). Like that your index will be small, and if you lost the ability to index because of ElasticSearch get too buzy, you keep the ability to save user's datas.
I have been using redis a lot lately, and really am loving it. I am mostly familiar with persistence (rdb and aof). I do have one concern. I would like to be able to selectively "archive" some of my data to disk (or cheaper storage) once it is no longer important. I don't really want to delete it because it might be valuable at some point.
All of my keys are named id_<id>_<someattribute>. So when I am done with id 4, I want to "archive" all all keys that match id_4_*. I can view them quite easily in with the command line, but I can't do anything with them, persay. I have quite a bit of data (very large bitmaps) associated with this data set, and frankly I can't afford the space once the id is no longer relevant or important.
If this were mysql, I would have my different tables and would very easily just dump it to a .sql file and then drop the table. The actual .sql file isn't directly useful to me, but I could reimport the data if/when I need it. Or maybe I have to mysql database and I want to move one table to another database. Are there redis corollaries to these processes? Is there someway to make an rdb or aof file that is a subset of the data?
Any help or input on this matter would be appreciated! Thanks!
#Hoseong Hwang recently asked what I did, so I'm posting what I ended up doing.
It was really quite simple, actually. I was benefited by the fact that my key space is segmented out by different users. All of my keys were of the structure user_<USERID>_<OTHERVALUES>. My archival needs were on a user basis, some user's data was no longer needed to be kept in redis.
So, I started up another instance of redis-server, on another port locally (6380?) or another machine, it makes no difference. Then, I wrote a short script that basically just called KEYS user_<USERID>_* (I understand the blocking nature of KEYS, my key space is so small it didn't matter, you can use SCAN if that is an issue for you.) Then, for each key, I MIGRATED them to that new redis-server instance. After they were all done. I did a SAVE to ensure that the rdb file for that instance was up to date. And now I have that rdb, which is just the content that I wanted to archive. I then terminated that temporary redis-server and the memory was reclaimed.
Now, keep that rdb file somewhere for cheap, safe keeping. And if you ever needed it again, doing the reverse of my process above to get those keys back into your main redis-server would be fairly straightforward.
Instead of trying to extract data from a live Redis instance for archiving purpose, my suggestion would be to extract the data from a dump file.
Run a bgsave command to generate a dump, and then use redis-rdb-tools to extract the keys you are interested in - you can easily get the result as a json file.
See https://github.com/sripathikrishnan/redis-rdb-tools
You can keep the json data in flat files, or try to store them into a relational database or a document store if you need them to be indexed for retrieval purpose.
A few suggestions for you...
I would like to be able to selectively "archive" some of my data to
disk (or cheaper storage) once it is no longer important. I don't
really want to delete it because it might be valuable at some point.
If such data is that valuable, use a traditional database for storage. Despite redis supporting snap-shotting to disk and AOF logs, you should view it as mostly volatile storage. The primary use case for redis is reducing latency, not persistence of valuable data.
So when I am done with id 4, I want to "archive" all all keys that
match id_4_*
What constitutes done? You need to ask yourself this question; does it mean after 1 day the data can fall out of redis? If so, just use TTL and expiration to let redis remove the object from memory. If you need it again, fall back to the database and pull the object back into redis. That first client will take the hit of pulling from the db, but subsequent requests will be cached. If done means something not associated with a specific duration, then you'll have to remove items from redis manually to conserve memory space.
If this were mysql, I would have my different tables and would very
easily just dump it to a .sql file and then drop the table. The actual
.sql file isn't directly useful to me, but I could reimport the data
if/when I need it.
We do the same at my firm. Important data is imported into redis from rdbms executed as on-demand job. We don't drop tables, we just selectively import data from the database into redis; nothing wrong with that.
Is there someway to make an rdb or aof file that is a subset of the
data?
I don't believe there is a way to do selective archiving; it's either all or none.
IMO, spend more time playing with redis. I highly recommend leveraging out-of-box features instead of reinventing and/or over-engineering solutions to suit your needs.
Hope that helps!...
every day we have an SSIS package running to import data into a database.
sometimes people are querying the database at the same time.
the loading (data import) times out because there's a table lock on the specific table.
what is the standard protocol on inserting data and querying data at the same time?
First you need to figure out where those locks are coming from. Use the link to see if there are any locks.
How to: Determine Which Queries Are Holding Locks
If you have another process that holds a table lock then not much you can do.
Are you sure the error is "not able to OBTAIN a table lock". If so look at changing your SSIS package to not use table locks.
There are several strategies.
One approach is to design your ETL pipeline as to minimize lock time. All the data is prepared in staging tables and then, when complete, is switched in using fast partition switch operations, see Transferring Data Efficiently by Using Partition Switching. This way the ETL blocks reads onyl for a very short duration. It also has the advantage that the reads see all the ETL data at once, not intermediate stages. The drawback is difficult implementation.
Another approach is to enable snapshot isolation and/or read committed snapshot in the database, see Row Versioning-based Isolation Levels in the Database Engine. This way reads no longer block behind the locks held by the ETL. the drawback is resource consumption, the hardware must be able to drive the additional load of row versioning.
Yet another approach is to move the data querying to a read-only standby server, eg. using log shipping.