System drive incremental clone - backup

I would have a question that will get a "-1" rating.
I once daily backup my data drive to a clone drive with SynckBack, which compares both disks and mirrors the copy to the original, just updating by adding/deleting files. Easy and fast, and the backup disk is a bit-to-bit clone of the original disk.
Although it's not exactly that, one can call this an "incremental" backup.
I would like to know if it's possible to do the same with my system disk, i.e. to once daily update a system drive copy in order to maintain a bit-to-bit clone that would be immediately bootable. Not by re-copying each time the whole system drive, but like for my data drive, just by adding/deleting once a day the slight amount of data which have daily changed.
Apart from building a RAID1 including my system disk, which implies letting the RAID running permanently, is there another way?
I didn't find any application that can bit-to-bit clone a system disk in such an "incremental" way.

I finally answer to my own question, in case someone would have the same wonder.
There is no app that can incrementially update a bit-to-bit clone drive. Such app are making disk images, which are not immediatly bootable and need first to be installed (nevertheless more quickly than installing Windows).
The only way to update a ready-to-use backup system drive is to totally re-clone it regularly, for instance with a standalone cloning dock station.
I give generously myself a "+1" rating.

Related

Syncing large amount of files across multiple machines in a scalable way

I'm looking for a way to sync a large number of machines (hundreds) with a remote repository.
The repository is comprised of small files (around 20KB), but the total arrives at a few GB and continue to grow with time.
The goal is to have changes at the remote repository propagate as fast as possible (no more than 2 seconds) to all the machines. (sync)
There are tools that provide exactly this functionality such as S3 sync or Rclone but carry a major disadvantage:
The sync command will need to enumerate all of the files in the bucket to determine whether a local file already exists in the bucket and if it is the same as the local file. The more documents you have in the bucket, the longer it's going to take. This means that once the bucket gets big even a small change will cost a lot of time.
I wonder if there is a way (a tool or a method) to sync only modified files, without having to go through all of the files. You can imagine a comparison of meta data at source and remote, determining what are the diffs and acting accordingly.
How would you go about it?

GCE snapshot - no system state saved?

Being relatively new to GCE, but not to other virtualization tools like VmWare or VirtuaBox, I'm not able to find in GCE a concrete way to get a full snapshot of a live machine.
I'm guessing it's my fault or poor knowledge, but really GCE doesn't saves the "system state", or else dumps memory to snapshot?
I'd found many scripts and examples on how to flush buffers to disks before I create the snapshot, but no way to obtain a complete state of the machine, including what the machine itself is running at THAT point.
Let me say that, if this is correct, the GCE snapshot IS NOT a snapshot.
Thanks in advance for your help.
That's a VM image, not a snapshot, and it does not include the contents of RAM or the processor state. A snapshot is a point-in-time copy of a persistent disk.
[link] (http://vcloud.vmware.com/uk/using-vcloud-air/tutorials/working-with-snapshots)
Here's an example of a cloud platform saving true snapshots, portraits of a specific second of a working machine.
Let me add a thought:
I don't know if VCloud is considering a particular state, gains privileged access to disks for a limited time, avoiding contingency, or else does a temporary duplication of the working disk in another volume.
I'm still reading around, trying to get INTO the problem.
BUT... it dumps memory to snapshot.
This is the point, and I'm wondering why this seems to be not possible in GCE.

Jackrabbit repository incremental backup

I'm using Jackrabbit v2.2.x. I want to know if is there a way to take incremental backup of a jackrabbit repository? I mean, just the delta (difference) based on date or something else. Actually the problem is that the repository size is in TeraBytes and every time we have to take production data it takes a lot of time to copy full repository.
If the storage backend support incremental backups, an incremental low level backup might be the easiest solution.
If not, possibly you could use the EventJournal to iterate over the changes since the last backup, and just backup those changes. Most likely this will require more work however.
Another solution is to do an incremental backup of the data store (if this is what uses most of the disk space), and do a full backup of the node data (persistence managers).

Moving 1 million image files to Amazon S3

I run an image sharing website that has over 1 million images (~150GB). I'm currently storing these on a hard drive in my dedicated server, but I'm quickly running out of space, so I'd like to move them to Amazon S3.
I've tried doing an RSYNC and it took RSYNC over a day just to scan and create the list of image files. After another day of transferring, it was only 7% complete and had slowed my server down to a crawl, so I had to cancel.
Is there a better way to do this, such as GZIP them to another local hard drive and then transfer / unzip that single file?
I'm also wondering whether it makes sense to store these files in multiple subdirectories or is it fine to have all million+ files in the same directory?
One option might be to perform the migration in a lazy fashion.
All new images go to Amazon S3.
Any requests for images not yet on Amazon trigger a migration of that one image to Amazon S3. (queue it up)
This should fairly quickly get all recent or commonly fetched images moved over to Amazon and will thus reduce the load on your server. You can then add another task that migrates the others over slowly whenever the server is least busy.
Given that the files do not exist (yet) on S3, sending them as an archive file should be quicker than using a synchronization protocol.
However, compressing the archive won't help much (if at all) for image files, assuming that the image files are already stored in a compressed format such as JPEG.
Transmitting ~150 Gbytes of data is going to consume a lot of network bandwidth for a long time. This will be the same if you try to use HTTP or FTP instead of RSYNC to do the transfer. An offline transfer would be better if possible; e.g. sending a hard disc, or a set of tapes or DVDs.
Putting a million files into one flat directory is a bad idea from a performance perspective. while some file systems would cope with this fairly well with O(logN) filename lookup times, others do not with O(N) filename lookup. Multiply that by N to access all files in a directory. An additional problem is that utilities that need to access files in order of file names may slow down significantly if they need to sort a million file names. (This may partly explain why rsync took 1 day to do the indexing.)
Putting all of your image files in one directory is a bad idea from a management perspective; e.g. for doing backups, archiving stuff, moving stuff around, expanding to multiple discs or file systems, etc.
One option you could use instead of transferring the files over the network is to put them on a harddrive and ship it to amazon's import/export service. You don't have to worry about saturating your server's network connection etc.

How do you back up your development machine? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
How do you back up your development machine so that in the event of a catastrophic hardware malfunction, you are up and running in the least amount of time possible?
There's an important distinction between backing up your development machine and backing up your work.
For a development machine your best bet is an imaging solution that offers as near a "one-click-restore" process as possible. TimeMachine (Mac) and Windows Home Server (Windows) are both excellent for this purpose. Not only can you have your entire machine restored in 1-2 hours (depending on HDD size), but both run automatically and store deltas so you can have months of backups in relatively little space. There are also numerous "ghosting" packages, though they usually do not offer incremental/delta backups so take more time/space to backup your machine.
Less good are products such as Carbonite/Mozy/JungleDisk/RSync. These products WILL allow you to retrieve your data, but you will still have to reinstall the OS and programs. Some have limited/no histories either.
In terms of backing up your code and data then I would recommend a sourcecode control product like SVN. While a general backup solution will protect your data, it does not offer the labeling/branching/history functionality that SCC packages do. These functions are invaluable for any type of project with a shelf-life.
You can easily run a SVN server on your local machine. If your machine is backed up then your SVN database will be also. This IMO is the best solution for a home developer and is how I keep things.
All important files are in version control (Subversion)
My subversion layout generally matches the file layout on my web server so I can just do a checkout and all of my library files and things are in the correct places.
Twice-daily backups to an external hard drive
Nightly rsync backups to a remote server.
This means that I send stuff on my home server over to my webhost and all files & databases on my webhost back home so I'm not screwed if I lose either my house or my webhost.
I use Mozy, and rarely think about it. That's one weight off my shoulders that I won't ever miss.
Virtual machines and CVS.
Desktops are rolled out with ghost and are completely vanilla.
Except they have VirtualBox.
Then developers pull the configured baseline development environment
down from CVS.
They log into the development VM image as themselves, refresh the source and libraries from CVS and they're up and working agian.
This also makes doing develpment and maintenance at the same time a lot easier.
(I know some people won't like CVS or VirtualBox, so feel free to substiture your tools of choice)
oh, and You check you work into a private branch off Trunk daily.
There you go.
Total time to recover : 1 hour (tops)
Time to "adopt" a shbiy new laptop for a customer visit : 1 hour ( tops)
And a step towards CMMI Configuration Management.
BTW your development machine should not contain anything of value. All your work (and your company's work) should be in central repositories (SVN).
I use TimeMachine.
For my home and development machines I use Acronis True Image.
In my opinion, with the HD cheap prices nothing replaces a full incremental daily HD backup.
A little preparation helps:
All my code is kept organized in one single directory (with categorized sub-directories).
All email is kept in various PSTs.
All code is also checked into source control at the end of every day.
All documents are kept in one place as well.
Backup:
Backup your code, email, documents as often as it suits you (daily).
Keep an image of your development environment always ready.
Failure and Recovery
If everything fails, format and install the image.
Copy back everything from backup and you are up and running.
Of course there are tweaks here and there (incremental backup, archiving, etc.) which you have to do to make this process real.
If you are talking absolute least amount of restore time... I've often setup machines to do Ghost (Symantec or something similar) backups on a nightly basis to either an image or just a direct copy to another drive. That way all you have to do is reimage the machine from the image or just swap the drives. You can be back up in under 10 minutes... The setup I did before was in situation where we had some production servers that were redundant and it was acceptable for them to be offline long enough to clone the drive...but only at night. During the day they had to be up 100%...it saved my butt a couple times when a main drive failed... I just opened the case, swapped the cables so the backup drive was the new master and was back online in 5 minutes.
I've finally gotten my "fully automated data back-up strategy" down to a fine art. I never have to manually intervene, and I'll never lose another harddrive worth of data. If my computer dies, I'll always have a full bootable back-up that is no more than 24 hours old, and incremental back-ups no more than an hour old. Here are the details of how I do it.
My only computer is a 160 gig MacBook running OSX Leopard.
On my desk at work I have 2 external 500 gig harddrives.
One of them is a single 500 gig partition called "External".
The other has a 160 gig partition called "Clone" and a 340 gig partition called TimeMachine.
TimeMachine runs whenever I'm at work, constantly backing up my "in progress" files (which are also committed to Version Control throughout the day).
Every weekday at 12:05, SuperDuper! automatically copies my entire laptop harddrive to the "Clone" drive. If my laptop's harddrive dies, I can actually boot directly from the Clone drive and pick up work without missing a beat -- giving me some time to replace the drive (This HAS happened to me TWICE since setting this up!). (Technical Note: It actually only copies over whatever has changed since the previous weekday at 12:05... not the entire drive every time. Works like a charm.)
At home I have a D-Link DNS-323, which is a 1TB (2x500 gig) Network Attached Storage device running a Mirrored RAID, so that everything on the first 500 gig drive is automatically copied to the second 500 gig drive. This way, you always have a backup, and it's fully automated. This little puppy has a built-in Dynamic DNS client, and FTP server.
So, on my WRT54G router, I forward the FTP port (21) to my DNS-323, and leave its FTP server up.
After the SuperDuper clone has been made, rSync runs and synchronizes my "External" drive with the DNS-323 at home, via FTP.
That's it.
Using 4 drives (2 external, 2 in the NAS) I have:
1) An always-bootable complete backup less than 24 hours old, Monday-Friday
2) A working-backup of all my in-progress files, which is never more than 30 minutes old, Monday-Friday (when I'm at work and connected to the external drives)
3) Access to all my MP3s (170GB) at documents at work on the "External" and at home on the NAS
4) Two complete backups of all my MP3s and documents on the NAS (External is original copy, both drives on NAS are mirrors via ChronoSync)
Why do I do all of this?
Because:
1) In 2000, I dropped a 40 gig harddrive 1 inch, and it cost me $2500 to get that data back.
2) In the past year, I've had to take my MacBook in for repair 4 times. One dead harddrive, two dead motherboards, and a dead webcam. On the 4th time, they replaced my MacBook with a newer better one at no charge, and I haven't had a problem since.
Thanks to my daily backups, I didn't lose any work, or productivity. If I hadn't had them, though, all my work would have been gone, along with my MP3s, and my writing, and all the photos of my trips to Peru, Croatia, England, France, Greece, Netherlands, Italy, and all my family photos. Can you imagine? I'm sure you can, because I bet you have a pile of digital photos sitting on your computer right now... not backed-up in any way.
A combination of RAID1, Acronis, xcopy, DVDs and ftp. See:
http://successfulsoftware.net/2008/02/04/your-harddrive-will-fail-its-just-a-question-of-when/
Maybe just a simple hardware hard disk raid would be a good start. This way if one drive fails, you still have the other drive in the raid. If something other than the drives fail you can pop these drives into another system and get your files quickly.
I'm just sorting this out at work for the team. An image with all common tools is on Network. (We actually have a hotswap machine ready). All work in progress is on network too.
So Developers machine goes boom. Use hotswap machine and continue. Downtime ~15 mins + coffee break.
We have a corporate solution pushed down on us called Altiris, which works when it wants to. It depends on whether or not it's raining outside. I think Altiris might be a rain-god, and just doesn't know it. I am actually delighted when it's not working, because it means that I can have my 99% of CPU usage back, thank you very much.
Other than that, we don't have any rights to install other software solutions for backing things up or places we are permitted to do so. We are not permitted to move data off of our machines.
So, I end up just crossing my fingers while laughing at the madness.
I don't.
We do continuous integration, submit code often to the central source control system (which is backed up like crazy!).
If my machine dies at most I've lost a couple of days work.
And all I need to do is get a clean disk at setup the dev environment from a ghost image or by spending a day sticking CDs in, rebooting after Windows update, etc. Not a pleasant day but I do get a nice clean machine.
At work NetBackup or PureDisk depending on the box, at home rsync.
like a few others, I have a clean copy of my virtual pc that I can grab and start fresh at anytime and all code is stored in subversion.
I use SuperDuper! and backup my Virtual Machine to another external drive (i have two).
All the code is on a SVN server.
I have a clean VM in case mine fails. But in either case it takes me a couple of hours to install WinXP+Vstudio. i don't use anything else in that box.
I use xcopy to copy all my personal files to an external hard drive on startup.
Here's my startup.bat:
xcopy d:\files f:\backup\files /D /E /Y /EXCLUDE:BackupExclude.txt
This recurses directories, only copies files that have been modified and suppresses the message to replace an existing file, the list of files/folders in BackupExclude.txt will not be copied.
Windows Home Server. My dev box has two drives with about 750GB of data between them (C: is a 300GB SAS 15K RPM drive with apps and system on it, D: is a mirrored 1TB set with all my enlistments). I use Windows Home Server to back this machine up and have successfully restored it several times after horking it.
My development machine is backed up using Retrospect and Acronis. These are nightly backups that run when I'm asleep - one to an external drive and one to a network drive.
All my source code is in SVN repositories, I keep all my repositories under a single directory so I have a scheduled task running a script that spiders a path for all SVN repositories and performs a number of hotcopies (using the hotcopy.py script) as well as an svndump of each repository.
My work machine gets backed up however they handle it, however I also have the same script running to do hotcopies and svndumps onto a couple of locations that get backed up.
I make sure that of the work backups, one location is NOT on the SAN, yes it gets backed up and managed, but when it is down, it is down.
I would like a recommendation for an external RAID container, or perhaps just an external drive container, preferably interfacing using FireWire 800.
I also would like a recommendation for a manufacturer for the backup drives to go into the container. I read so many reviews of drives saying that they failed I'm not sure what to think.
I don't like backup services like Mozy because I don't want to trust them to not look at my data.
SuperDuper complete bootable backups every few weeks
Time Machine backups for my most important directories daily
Code is stored in network subversion/git servers
Mysql backups with cron on the web servers, use ssh/rsync to pull it down onto our local servers also using cron nightly.
If you use a Mac, it's a no brainer - just plug in an external hard drive and the built in Time Machine software will back up your whole system, then maintain an incremental backup on the schedule you define. This has got me out of a hole many a time when I've messed up my environment; it also made it super easy to restore my system after installing a bigger hard drive.
For offsite backups, I like JungleDisk - it works on Mac, Windows and Linux and backs up to Amazon S3 (or, added very recently, the Rackspace cloud service). This is a nice solution if you have multiple machines (or even VMs) and want to keep certain directories backed up without having to think about it.
Home Server Warning!
I installed Home Server on my development Server for two reasons: Cheap version of Windows Server 2003 and for backup reasons.
The backup software side of things is seriously hit or miss. If you 'Add' a machine to the list of computers to be backed up right at the start of installing Home Server, generally everything is great.
BUT it seems it becomes a WHOLE lot harder to add any other machines after a certain amount of time has passed.
(Case in point: I did a complete rebuild on my laptop, tried to Add it - NOPE!)
So i'm seriously doubting the reliability of this platform for backup purposes. Seems to be defeating the purpose if you can't trust it 100%
I have the following backup scenarios and use rsync as a primary backup tool.
(weekly) Windows backup for "bare metal" recovery
Content of System drive C:\ using Windows Backup for quick recovery after physical disk failure, as I don't want to reinstall Windows and applications from scratch. This is configured to run automatically using Windows Backup schedule.
(daily and conditional) Active content backup using rsync
Rsync takes care of all changed files from laptop, phone, other devices. I backup laptop every night and after significant changes in content, like import of the recent photo RAWs from SD card to laptop.
I've created a bash script that I run from Cygwin on Windows to start rsync: https://github.com/paravz/windows-rsync-backup