Jackrabbit repository incremental backup - backup

I'm using Jackrabbit v2.2.x. I want to know if is there a way to take incremental backup of a jackrabbit repository? I mean, just the delta (difference) based on date or something else. Actually the problem is that the repository size is in TeraBytes and every time we have to take production data it takes a lot of time to copy full repository.

If the storage backend support incremental backups, an incremental low level backup might be the easiest solution.
If not, possibly you could use the EventJournal to iterate over the changes since the last backup, and just backup those changes. Most likely this will require more work however.
Another solution is to do an incremental backup of the data store (if this is what uses most of the disk space), and do a full backup of the node data (persistence managers).

Related

DynamoDB backup and restore using Data pipelines. How long does it take to backup and recover?

I'm planning to use Data pipelines as a backup and recovery tool for our DynamoDB. We will be using amazon's prebuilt pipelines to backup to s3, and use the prebuilt recovery pipeline to recover to a new table in case of a disaster.
This will also serve a dual purpose of data archival for legal and compliance reasons. We have explored snapshots, but this can get quite expensive compared to s3. Does anyone have an estimate on how long it takes to backup a 1TB database? And How long it takes to recover a 1TB database?
I've read amazon docs and it says it can take up to 20 minutes to restore from a snapshot but no mention of how long for a data pipeline. Does anyone have any clues?
Does the newly released feature of exporting from DynamoDB to S3 do what you want for your use case? To use this feature, you must have continuous backups enabled though. Perhaps that will give you the short term backup you need?
It would be interesting to know why you're not planning to use the built-in backup mechanism. It offers point in time recovery and it is highly predictable in terms of cost and performance.
The Data Pipelines backup is unpredictable, will very likely cost more and operationally it is much less reliable. Plus getting a consistent snapshot (ie point in time) requires stopping the world. Speaking from experience, I don't recommend using Data Pipelines for backing up DynamoDB tables!
Regarding how long it takes to take a backup, that depends on a number of factors but mostly on the size of the table and the provisioned capacity you're willing to throw at it, as well as the size of the EMR cluster you're willing to work with. So, it could take anywhere from a minute to several hours.
Restoring time also depends on pretty much the same variables: provisioned capacity and total size. And it can also take anywhere from a minute to many hours.
Point in time backups offer consistent, predictable and most importantly reliable performance regardless of the size of the table: use that!
And if you're just interested in dumping the data from the table (i.e not necessarily the restore part) use the new export to S3.

Delete records from SQL Server backup file

It is an insane idea to delete records from backup since the notion of backup is to serve on disaster. But in our case, data deletion is a valid use-case.
Requirement: in brief, we are in need of a system which is capable of deleting a specific record from an active database instance and from all its backups.
We have a fully functional internal system which is capable of performing the mentioned requirement of deleting data from active database. But what we don't know is how to do the same agonist all these database backups.
Question:
Is it possible to find a specific record from a backup?
Is there any predefined schema or data allocation style within SQL Server backup file, which allow us to isolate a specific record?
Can you share any thoughts or experience you have on such style of deletion?
Note: we take 2 full backup daily and store a week worth (14 in total) at any point in time.
I do understand the business concept of "deleted everywhere".
I do not know of any way to do this. I do not believe the format of the backup is even published. That doesn't mean that someone hasn't hacked it, but it certainly isn't a broadly known capability.
I think that, in order to do this, you will need to securely wipe all copies of backups and take new backups. You then lose the point in time recovery capability.
Solution: The way that I would address this business requirement is to recover each backup, delete the desired record(s), secure wipe the backup media (or destroy the old media and use new media), and then take a new backup of THAT recovered version. That will give you a point in time recovery of that data without the specific record(s).
You can't modify the contents of a .bak file. You shouldn't want to do that either. If you want to restore to a specific point in time you should use the Full recovery model and take differential and log backups instead of just full backups.

How to do a quick differential backup and restore in SQL Server

I'm using SpecFlow with Selenium for doing UI testing on my ASP.NET MVC websites. I want to be able to restore the database (SQL Server 2012) to its pre-test condition after I have finished my suite of tests, and I want to do it as quickly as possible. I can do a full backup and restore with replace (or with STOPAT), but that takes well over a minute, when the differential backup itself took only a few seconds. I want to basically set a restore point and then revert to it as quickly as possible, dropping any changes made since the backup. It seems to me that this should be able to be done very quickly, without needing to overwrite the whole database. Is this possible, and if so, how?
Not with a differential backup. What a differential backup is is an image of all of the data pages that have changed since the last full backup. In order to restore a differential backup, you must first restore it's base (i.e. full) backup and then the differential.
What you're asking for is some process that keeps track of the changes since the backup. That's where a database snapshot will shine. It's a copy-on-write technology which is to say that when you make a data modification, it writes the pre-change state of the data page to the snapshot file before writing the change itself. So reverting is quick as it need only pull back those changed pages from the snapshot. Check out the "creating a database snapshot" example in the CREATE DATABASE documentation.
Keep in mind that this isn't really a good guard against failure (which is one of the reasons to take a backup). But for your described use case, it sounds like a good fit.

archiving some redis data to disk

I have been using redis a lot lately, and really am loving it. I am mostly familiar with persistence (rdb and aof). I do have one concern. I would like to be able to selectively "archive" some of my data to disk (or cheaper storage) once it is no longer important. I don't really want to delete it because it might be valuable at some point.
All of my keys are named id_<id>_<someattribute>. So when I am done with id 4, I want to "archive" all all keys that match id_4_*. I can view them quite easily in with the command line, but I can't do anything with them, persay. I have quite a bit of data (very large bitmaps) associated with this data set, and frankly I can't afford the space once the id is no longer relevant or important.
If this were mysql, I would have my different tables and would very easily just dump it to a .sql file and then drop the table. The actual .sql file isn't directly useful to me, but I could reimport the data if/when I need it. Or maybe I have to mysql database and I want to move one table to another database. Are there redis corollaries to these processes? Is there someway to make an rdb or aof file that is a subset of the data?
Any help or input on this matter would be appreciated! Thanks!
#Hoseong Hwang recently asked what I did, so I'm posting what I ended up doing.
It was really quite simple, actually. I was benefited by the fact that my key space is segmented out by different users. All of my keys were of the structure user_<USERID>_<OTHERVALUES>. My archival needs were on a user basis, some user's data was no longer needed to be kept in redis.
So, I started up another instance of redis-server, on another port locally (6380?) or another machine, it makes no difference. Then, I wrote a short script that basically just called KEYS user_<USERID>_* (I understand the blocking nature of KEYS, my key space is so small it didn't matter, you can use SCAN if that is an issue for you.) Then, for each key, I MIGRATED them to that new redis-server instance. After they were all done. I did a SAVE to ensure that the rdb file for that instance was up to date. And now I have that rdb, which is just the content that I wanted to archive. I then terminated that temporary redis-server and the memory was reclaimed.
Now, keep that rdb file somewhere for cheap, safe keeping. And if you ever needed it again, doing the reverse of my process above to get those keys back into your main redis-server would be fairly straightforward.
Instead of trying to extract data from a live Redis instance for archiving purpose, my suggestion would be to extract the data from a dump file.
Run a bgsave command to generate a dump, and then use redis-rdb-tools to extract the keys you are interested in - you can easily get the result as a json file.
See https://github.com/sripathikrishnan/redis-rdb-tools
You can keep the json data in flat files, or try to store them into a relational database or a document store if you need them to be indexed for retrieval purpose.
A few suggestions for you...
I would like to be able to selectively "archive" some of my data to
disk (or cheaper storage) once it is no longer important. I don't
really want to delete it because it might be valuable at some point.
If such data is that valuable, use a traditional database for storage. Despite redis supporting snap-shotting to disk and AOF logs, you should view it as mostly volatile storage. The primary use case for redis is reducing latency, not persistence of valuable data.
So when I am done with id 4, I want to "archive" all all keys that
match id_4_*
What constitutes done? You need to ask yourself this question; does it mean after 1 day the data can fall out of redis? If so, just use TTL and expiration to let redis remove the object from memory. If you need it again, fall back to the database and pull the object back into redis. That first client will take the hit of pulling from the db, but subsequent requests will be cached. If done means something not associated with a specific duration, then you'll have to remove items from redis manually to conserve memory space.
If this were mysql, I would have my different tables and would very
easily just dump it to a .sql file and then drop the table. The actual
.sql file isn't directly useful to me, but I could reimport the data
if/when I need it.
We do the same at my firm. Important data is imported into redis from rdbms executed as on-demand job. We don't drop tables, we just selectively import data from the database into redis; nothing wrong with that.
Is there someway to make an rdb or aof file that is a subset of the
data?
I don't believe there is a way to do selective archiving; it's either all or none.
IMO, spend more time playing with redis. I highly recommend leveraging out-of-box features instead of reinventing and/or over-engineering solutions to suit your needs.
Hope that helps!...

DynamoDB - How to do incremental backup?

I am using DynamoDB tables with keys and throughput optimized for application use cases. To support other ad hoc administrative and reporting use cases I want to keep a complete backup in S3 (a day old backup is OK). Again, I cannot afford to scan the entire DynamoDB tables to do the backup. The keys I have are not sufficient to find out what is "new". How do I do incremental backups? Do I have to modify my DynamoDB schema, or add extra tables just to do this? Any best practices?
Update:
DynamoDB Streams solves this problem.
DynamoDB Streams captures a time-ordered sequence of item-level
modifications in any DynamoDB table, and stores this information in a
log for up to 24 hours. Applications can access this log and view the
data items as they appeared before and after they were modified, in
near real time.
I see two options:
Generate the current snapshot. You'll have to read from the table to do this, which you can do at a very slow rate to stay under your capacity limits (Scan operation). Then, keep an in-memory list of updates performed for some period of time. You could put these in another table, but you'll have to read those, too, which would probably cost just as much. This time interval could be a minute, 10 minutes, an hour, whatever you're comfortable losing if your application exits. Then, periodically grab your snapshot from S3, replay these changes on the snapshot, and upload your new snapshot. I don't know how large your data set is, so this may not be practical, but I've seen this done with great success for data sets up to 1-2GB.
Add read throughput and backup your data using a full scan every day. You say you can't afford it, but it isn't clear if you mean paying for capacity, or that the scan would use up all the capacity and the application would begin failing. The only way to pull data out of DynamoDB is to read it, either strongly or eventually consistent. If the backup is part of your business requirements, then I think you have to determine if it's worth it. You can self-throttle your read by examining the ConsumedCapacityUnits property on your results. The Scan operation has a Limit property that you can use to limit the amount of data read in each operation. Scan also uses eventually consistent reads, which are half the price of strongly consistent reads.
You can now use dynamoDB streams to have data persisted into anthother table or maintain another copy of data in another datastore.
https://aws.amazon.com/blogs/aws/dynamodb-streams-preview/
For incremental backups, you can associate your DynamoDB Stream with a Lambda Function to automatically trigger code for every data update (Ie: data to another store like S3)
A lambda function you can use to tie up with DynamoDb for incremental backups:
https://github.com/PageUpPeopleOrg/dynamodb-replicator
I've provided a detailed walk through on how you can use DynamoDB Streams, Lambda and S3 versioned buckets to create incremental backups for your data in DynamoDb on my blog:
https://www.abhayachauhan.com/category/aws/dynamodb/dynamodb-backups
Alternatively, DynamoDB just realised On Demand backups and restores. They aren't incremental, but full backup snapshots.
Check out https://www.abhayachauhan.com/2017/12/dynamodb-scheduling-on-demand-backups/ for more information.
HTH
On November 29th, 2017 On-Demand Backup was introduced. It allows for you to create backups directly in DynamoDB essentially instantly without consuming any capacity. Here are a few snippets from the blog post:
This feature is designed to help you to comply with regulatory requirements for long-term archival and data retention. You can create a backup with a click (or an API call) without consuming your provisioned throughput capacity or impacting the responsiveness of your application. Backups are stored in a highly durable fashion and can be used to create fresh tables.
...
The backup is available right away! It is encrypted with an Amazon-managed key and includes all of the table data, provisioned capacity settings, Local and Global Secondary Index settings, and Streams. It does not include Auto Scaling or TTL settings, tags, IAM policies, CloudWatch metrics, or CloudWatch Alarms.
You may be wondering how this operation can be instant, given that some of our customers have tables approaching half of a petabyte. Behind the scenes, DynamoDB takes full snapshots and saves all change logs. Taking a backup is as simple as saving a timestamp along with the current metadata for the table.
A scan operation in DynamoDB returns rows sorted by primary key (hash key). So if a table's hash key is an auto-incremented integer, then set the hash key of the last record saved during the previous backup as "lastEvaluatedKey" parameter for a scan request when doing the next backup, and the scan will return records which have been created since the last backup only.