How to back up a huge file with Bacula?

How to back up a huge file with Bacula? - backup

It currently is 700MB but it's conceivable that it'll grow beyond the 1GB. Normally I just copy this file to another location (for the curious, it's the database of a Zope instance,
a ZODB file).
This file changes little from day to day, but I understand Bacula can't do inside-the-file subdivision for incremental backups. Anyway, it doesn't matter. What I want to do is a full backup daily and keep two of them and a full backup weekly and also keep two of them. So
at any given time I can get yesterday, the day before yesterday, a week
ago and two weeks ago. Would you think that's a good idea?
I suppose I should make two schedules, daily and weekly. But which numbers should I have on the volumes and the pools to achieve this? Two volumes of 1.5GB? Any hints or guidance is welcome, I'm not a sysadmin and my experience with Bacula is very limited.

Online backup of a large database file is risky business, as the file might change while you are reading it, rendering the backup inconsistent and possibly useless. I believe you should not be making backups of the ZODB file itself, but rather of diffs created daily by the repozo tool. This way, you also outsource the job of handling the inside-the-file subdivisions that you say Bacula is incapable of dealing with.

In my experience with bacula and backup to disk, it is best to keep one volume per backup job. That way there is no dead space in the files as jobs expire. Bacula can reuse the whole volume and it cuts down on disk utilization. Use the "Set Maximum Volume Jobs = 1" directive in the pool resource.
I would set up two pools, a daily and weekly. Set the volume retention to two days in the daily and two weeks in the weekly. Schedule the daily on say, mon-sat, and the weekly on sunday.

Depending on your infrastructure, I would recommend taking a snapshot of the volume you are backing up to "freeze" it and make the backup from there.
For some of our backups we are using LVM snapshots (http://tldp.org/HOWTO/LVM-HOWTO/snapshots_backup.html), to avoid locking any of our databases (we have terabytes of data to backup and a lock would have a huge impact on the service)
Then, as you said that the database is not moving too much, I would go and have a 6 days retention period, 6 volumes for Dailies and 2 volumes for weeklies. Your Dailies should hit the Incremental backup pool, and the Weeklies should be the fulls.
For example, make the Weeklies (Fulls) run on Monday and then an incremental every day (Tue-Sun). This will allow you to come back any day of the week if you realize your data is corrupted, without taking too much space or time during the backup.
EDIT: And... I should check the post dates before answering. Haha. 3 Years late.

For the open source bacula (bacula.org) the best idea is to use the "Set Maximum Volume Jobs = 1" directive indeed.
If you want the "inside-the-file subdivision for incremental backups", please consider the Delta Plugin from Bacula Systems - https://www.baculasystems.com/products/bacula-enterprise-plugins/delta.

Related

DynamoDB backup and restore using Data pipelines. How long does it take to backup and recover?

I'm planning to use Data pipelines as a backup and recovery tool for our DynamoDB. We will be using amazon's prebuilt pipelines to backup to s3, and use the prebuilt recovery pipeline to recover to a new table in case of a disaster.
This will also serve a dual purpose of data archival for legal and compliance reasons. We have explored snapshots, but this can get quite expensive compared to s3. Does anyone have an estimate on how long it takes to backup a 1TB database? And How long it takes to recover a 1TB database?
I've read amazon docs and it says it can take up to 20 minutes to restore from a snapshot but no mention of how long for a data pipeline. Does anyone have any clues?

Does the newly released feature of exporting from DynamoDB to S3 do what you want for your use case? To use this feature, you must have continuous backups enabled though. Perhaps that will give you the short term backup you need?

It would be interesting to know why you're not planning to use the built-in backup mechanism. It offers point in time recovery and it is highly predictable in terms of cost and performance.
The Data Pipelines backup is unpredictable, will very likely cost more and operationally it is much less reliable. Plus getting a consistent snapshot (ie point in time) requires stopping the world. Speaking from experience, I don't recommend using Data Pipelines for backing up DynamoDB tables!
Regarding how long it takes to take a backup, that depends on a number of factors but mostly on the size of the table and the provisioned capacity you're willing to throw at it, as well as the size of the EMR cluster you're willing to work with. So, it could take anywhere from a minute to several hours.
Restoring time also depends on pretty much the same variables: provisioned capacity and total size. And it can also take anywhere from a minute to many hours.
Point in time backups offer consistent, predictable and most importantly reliable performance regardless of the size of the table: use that!
And if you're just interested in dumping the data from the table (i.e not necessarily the restore part) use the new export to S3.

Might transactional backup cause the operations to be deadlock, etc?

In SQL Server, I get Full and Transactional Log Backup (full: once in a day, transactional: hourly during workimng hours). As far as I see, there are some advantages of transactinal log backup over differential backup. Rearding to these issues, could you clarify me about the following points?
1. When getting transactional backup hourly during employees continue their operations with the data, might there be some problems like deadlock, or corruption of the data? I use job script in SQL Server Management Studio to get backup, but have no idea how SQL Server treats the records that are currently started to be edited.
2. In general looking, what do you suggest for backup selection in addition to full backup? Transactional Log or Differential backup?

No :)
Backups using the backup command do not require locks on any user tables.
Transaction log backups are usually more frequent than hourly, would your company really be okay with loosing an hours worth of data if something bad happened to you database disks?
Your schedule needs to depend on what your requirements are for your RPO (recovery point objective) and RTO (recovery time objective). If can only sustain 5 minutes worth of lost data then a 5 minute transaction log backup is required. If you can only cope with 1 hour worth of downtime then you need to make sure that you have data backups that can be restored and recovered in that amount of time - the first part will depend on how optimized your restore is (ie how long it takes to read the backups from your backup drives and write the data files back to your data drives - https://www.mssqltips.com/sqlservertip/4935/optimize-sql-server-database-restore-performance/#:~:text=%20Optimize%20SQL%20Server%20Database%20Restore%20Performance%20,restore%20the%20database%20by%20using%20some...%20More%20 has some ideas. The second part will depend on how much transaction log data needs to be read and applied back to the database to recover it to the desired point.
You might find that you simply can't do full database backups fast enough, in those cases incremental backups could work as there's less data to write but SQL Server will then have to put it back together.
Of course, if the restore is happening manually then you also need to account for human time in there!
It's a good idea to try out your backup and recovery process (before PROD!), this way you can tell if you're going to need to optimize the process further.

Delete records from SQL Server backup file

It is an insane idea to delete records from backup since the notion of backup is to serve on disaster. But in our case, data deletion is a valid use-case.
Requirement: in brief, we are in need of a system which is capable of deleting a specific record from an active database instance and from all its backups.
We have a fully functional internal system which is capable of performing the mentioned requirement of deleting data from active database. But what we don't know is how to do the same agonist all these database backups.
Question:
Is it possible to find a specific record from a backup?
Is there any predefined schema or data allocation style within SQL Server backup file, which allow us to isolate a specific record?
Can you share any thoughts or experience you have on such style of deletion?
Note: we take 2 full backup daily and store a week worth (14 in total) at any point in time.

I do understand the business concept of "deleted everywhere".
I do not know of any way to do this. I do not believe the format of the backup is even published. That doesn't mean that someone hasn't hacked it, but it certainly isn't a broadly known capability.
I think that, in order to do this, you will need to securely wipe all copies of backups and take new backups. You then lose the point in time recovery capability.
Solution: The way that I would address this business requirement is to recover each backup, delete the desired record(s), secure wipe the backup media (or destroy the old media and use new media), and then take a new backup of THAT recovered version. That will give you a point in time recovery of that data without the specific record(s).

You can't modify the contents of a .bak file. You shouldn't want to do that either. If you want to restore to a specific point in time you should use the Full recovery model and take differential and log backups instead of just full backups.

Query about EBS Backed Instances + Backup on S3 + Snapshots

I've spent a number of days looking into putting up two Windows Servers on Amazon, a domain controller and a remote desktop services server but there are a few questions that I can't find detailed or any answers for:
1) When you have an EBS backed instance I assume this means that all files (OS/Applications/Pagefile) etc are all stored on EBS? Physically in the datacentre, lets assume I have 50 gig of OS files/application data etc, are these all stored on just one SAN type device? What happens if that device blows up or say that particular data centre gets destroyed. Is the data elsewhere? What is the probability that your entire EBS volume can just disappear?
2) As I understand it you can backup your EBS instance to S3 with snapshotting. I assume you can choose how often to snapshot (say daily?). In my above scenario if I have 50 gig of files, and snapshot once a day. Over 7 days will my S3 storage be 350 gig or will it be 50 gig + incremental changes I have made over the week?
3) I remember reading somewhere that the instance has to go offline to snapshot. If that is the case does it do this by shutting down the guest OS, snapshotting then booting up or does it just detach the data, prevent you from connecting while it snapshots, then bring it back to the exact moment before it went for a snapshot.
4) I understand the concept of paying per month per gig of space but how I am concerned about the $0.11 per 1 million I/O requests. How does that work when I am running a windows server? I have no idea how many I/O requests a server makes to its disks. I am assuming a lot of the entire VM is being stored on an EBS volume. Is running a server on the standard EBS going to slow it down radically?
5) Are people using the snapshot to S3 as their main backup are are people running other types of backup for Data?
Sorry for the noob questions - I'd appreciate any partial answers, answers or advice anyone could offer me. Thanks in advance!

1) amazon is fuzzy on this. They say that data is replicated within the AZ it belongs to and that if you have less than 20GB of data changed since the last snapshot your annual failure rate is ~ 0.1-0.4%
2) snapshots are triggered manually, and are done incrementally
3) Depends on your filesystem. For example on a linux box with an xfs volume you can freeze IO to the volume, do your snapshot (takes only a second or so) and then unfreeze. If you take a snapshot without doing something similar you run the risk of the data being in an inconsistent state. This will depend on your filesystem
4) I run all my instances on EBS. You probably wouldn't want your pagefile on EBS, it would make more sense to use instance storage for that. The amount of IOs you use will be very dependant on the workload. The IO count depends heavily on your workload - an application server does a lot less IOPs than a database server for example. You're unlikely to use more than a few dollars a month per volume if you're running particularly IO heavy operations
5) Personally I don't care about the installed software/configuration (I have AMIs with that all setup so I can restore that in minutes), I only care about the data. I back that data up separately (S3 & Glacier). Partly that's because I was bitten by a bug EBS had about a year ago or so where they lost some snapshots
You also use multiple strategies, as Fantius commented. For example on the mongodb servers I run the boot volume is small (and never snapshotted or backed up since it can be restored automatically from an AMI), with a separate data volume containing the actual mongodb data. The mongodb volume is snapshotted as well as storing dumps on S3. Snapshots are an efficient way of creating backups (since you're only storing incremental changes) however you can't transfer them out of your EC2 region, whereas a tarball on S3 can easily be copied anywhere.

Offsite backups - possible with large amounts of code/source images etc?

The biggest hurdle I have in developing an effective backup strategy is being able to do some sort of offsite backup. Unfortunately, this can only be via uploading data to the offsite source but my internet cable has upload speeds which prohibit this.
Has anyone here managed to do offsite backups of large libraries of source code?
This is only relevant to home users and not in the workplace where budgets may open up doors.
EDIT: I am using Windows Vista (So 'nix solutions aren't relevant).
Thanks

I don't think your connections upload speed will be as prohibitive as you think. Just make sure you look for a solution where your changes can be sent as diffs. Even if your initial sync takes days, daily changes would likely be more manageable.
Knowing a few more specifics about how much data you are talking about, and exactly how slow your connection is, I think would allow the community to make more specific suggestions.

Services like Mozy allow you to back up large amounts of data offsite.
They upload slowly in the background, and getting the initial sync to the servers can take a while depending on your speed and amount of data, but after that they use efficient diffs to keep the stored data in sync.
The pricing is very home-friendly too.

I think you have to define backup and whats acceptable to you.
At my house, i have a hot backup of our repositories where I poll svn once an hour over the VPN and it takes down any check ins. this is just to catch any check ins that are not captured each 24 hours via the normal backup. I also send a full backup every 2 days through the pipe to be outside of the normal 3 tier backup we do at the office. our current zipped repository is 2GB zipped at max compression. That takes 34 hrs at 12 k/s and 17hrs at 24k/s, you did not say the speed of your connection, so its hard to judge if thats workable.
If this isnt viable, you might want to invest in a couple of 2.5" USB drives and load/swap them offsite to a safety deposit box at the bank. this used to be my responsibility but I lacked the discipline to do this consistently each week to assure some safety net. In the end it was just easier to live uploading the data to an ftp site at my house.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas