Backing up large file directory

Backing up large file directory - backup

I'm in need of suggestions for backing up a very large file directory (500+ GB) on a weekly basis. (In a LAMP environment)
I've read about rsync, but I'm not sure how to trigger that with CRON.
Also, does anyone happen to know how much the compression in rsync shrinks the filesize? (Lets say of a 3MB .jpeg). I only ask because I can't tell how large of a backup server I will need.
Just pointing me in the right direction will be enough, I don't expect anyone to do my homework for me. Thanks in advance for any help!

Here is a wiki page that has much of your question answered.
I would read the whole page to grasp one concept at a time, but the "snapshot backup" is the rsync-script-to-beat-all-rsync-scripts: it does a TimeMachine-like backup where it does differential storage going backwards in time, which is quite handy. This is great if you need chronologically-aware but minimally-sized backups.
Arch (the distro for which this wiki covers) does a really nice thing where you can just drop your scripts into a known location; you will have to adapt that to calling a script as a cron job. Here is a fairly comprehensive introduction to cron.
I would also like to point out that rsync's compression operates on transmission not on storage. The file should be identical on your backup disk, but may take less bandwidth to transfer.

It's going to take some time regardless if it is that large - I would run a compression job through cron and then a big 'ol robocopy (windows) or robocopy equivalent on UNIX of the compressed files also through cron.
You may want to look into a RAID arrangement to deal with this (RAID 1, particularly) giant amount of data so it doesn't have to be a "job" and is done implicitly. But whether you can implement that probably depends on your resources and your situation (you will have worse write times..much worse).

Related

How do I ensure data integrity of ISO images?

I want to create a long-term data archive of old stuff I don't need daily, but don't want to throw away either (e.g. all raw data of my thesis work). Optical media have failed me too often in the past, so now I am using an external USB disk and - to protect against accidental modification of the archive - I create ISO images of data batches and store these (and mount them on demand). The harddisk is NTFS formatted for portability (read/write for Linux and Windows, and at least readable for Macs).
My question is:
Are ISO images on external harddisks a good idea for long-term archiving data? How about bad disk sectors? It sure sounds easier for the OS to fsck a disk with 200 ISO images instead of 2,000,000 separate files, but is it? Should bad disk sectors be my primary worry when thinking about long term archives?
Any ideas - or alternatives - for an affordable long-term data storage concept would be appreciated.

First of all this question should be on SuperUser.
Nevertheless you strategy is is pretty solid. I would use disks in raid for added protection.
I you want to make sure the isos haven't changed you can take their md5sum when you store them and compare it to their md5sum when you retreive them.

You can use ISO image files, however they are neither very efficient nor in any way reliable. So the advantage of direct mount is only limited.
Maybe you need to combine both - store the ISOs in larger redundant archives (like http://en.wikipedia.org/wiki/Parchive) and then unpack them on demand (or just keep multiple copies, but then you should check and re-copy them regularly).

Is there a reverse-incremental backup solution with built-in redundancy (e.g. par2)?

I'm setting a home server primarily for backup use. I have about 90GB of personal data that must be backed up in the most reliable manner, while still preserving disk space. I want to have full file history so I can go back to any file at any particular date.
Full weekly backups are not an option because of the size of the data. Instead, I'm looking along the lines of an incremental backup solution. However, I'm aware that a single corruption in a set of incremental backups makes the entire series (beyond a point) unrecoverable. Thus simple incremental backups are not an option.
I've researched a number of solutions to the problem. First, I would use reverse-incremental backups so that the latest version of the files would have the least chance of loss (older files are not as important). Second, I want to protect both the increments and backup with some sort of redundancy. Par2 parity data seems perfect for the job. In short, I'm looking for a backup solution with the following requirements:
Reverse incremental (to save on disk space and prioritize the most recent backup)
File history (kind of a broader category including reverse incremental)
Par2 parity data on increments and backup data
Preserve metadata
Efficient with bandwidth (bandwidth saving; no copying the entire directory over for each increment). Most incremental backup solutions should work this way.
This would (I believe) ensure file integrity and relatively small backup sizes. I've looked at a number of backup solutions already but they have a number of problems:
Bacula - Simple normal incremental backups
bup - incremental and implements par2 but isn't reverse incremental and doesn't preserve metadata
duplicity - incremental, compressed, and encrypted but isn't reverse incremental
dar - incremental and par2 is easy to add, but isn't reverse incremental and no file history?
rdiff-backup - almost perfect for what I need but it doesn't have par2 support
So far I think that rdiff-backup seems like the best compromise but it doesn't support par2. I think I can add par2 support to backup increments easily enough since they aren't modified each backup but what about the rest of the files? I could generate par2 files recursively for all files in the backup but this would be slow and inefficient, and I'd have to worry about corruption during a backup and old par2 files. In particular, I couldn't tell the difference between a changed file and a corrupt file, and I don't know how to check for such errors or how they would affect the backup history. Does anyone know of any better solution? Is there a better approach to the issue?
Thanks for reading through my difficulties and for any input you can give me. Any help would be greatly appreciated.

http://www.timedicer.co.uk/index
Uses rdiff-backup as the engine. I've been looking at it, but that requires me to set up a "server" using linux or a virtual machine.
Personally, I use WinRAR to make pseudo-incremental backups (it actually makes a full backup of recent files) run daily by a scheduled task. It is similarly a "push" backup.
It's not a true incremental (or reverse-incremental) but it saves different versions of files based on when it was last updated. I mean, it saves the version for today, yesterday and the previous days, even if the file is identical. You can set the archive bit to save space, but I don't bother anymore as all I backup are small spreadsheets and documents.
RAR has it's own parity or recovery record that you can set in size or percentage. I use 1% (one percent).
It can preserve metadata, I personally skip the high resolution times.
It can be efficient since it compresses the files.
Then all I have to do is send the file to my backup. I have it copied to a different drive and to another computer in the network. No need for a true server, just a share. You can't do this for too many computers though as Windows workstations have a 10 connection limit.
So for my purpose, which may fit yours, backs up my files daily for files that have been updated in the last 7 days. Then I have another scheduled backup that backups files that have been updated in the last 90 days run once a month or every 30 days.
But I use Windows, so if you're actually setting up a Linux server, you might check out the Time Dicer.

Since nobody was able to answer my question, I'll write a few possible solutions I found while researching the topic. In short, I believe the best solution is rdiff-backup to a ZFS filesystem. Here's why:
ZFS checksums all blocks stored and can easily detect errors.
If you have ZFS set to mirror your data, it can recover the errors by copying from the good copy.
This takes up less space than full backups, even though the data is copied twice.
The odds of an error in both the original and mirror is tiny.
Personally I am not using this solution as ZFS is a little tricky to get working on Linux. Btrfs looks promising but hasn't been proven stable from years of use. Instead, I'm going with a cheaper option of simply checking hard drive SMART data. Hard drives should do some error checking/correcting themselves and by monitoring this data I can see if this process is working properly. It's not as good as additional filesystem parity but better than nothing.
A few more notes that might be interesting to people looking into reliable backup development:
par2 seems to be dated and buggy software. zfec seems like a much faster modern alternative. Discussion in bup occurred a while ago: https://groups.google.com/group/bup-list/browse_thread/thread/a61748557087ca07
It's safer to calculate parity data before even writing to disk. i.e. don't write to disk, read it, and then calculate parity data. Do it from ram, and check against the original for additional reliability. This might only be possible with zfec, since par2 is too slow.

Which technology should be used for serving large number of static files?

My main aim is to serve large number of XML files ( > 1bn each <1kb) via web server. Files can be considered as staic as those will be modified by external code, in relatively very low frequency (about 50k updates per day). Files will be requested in high frequency (>30 req/sec).
Current suggestion from my team is to create a dedicated Java application to implement HTTP protocal and use memcached to speed up the thing, keeping all file data in RDBMS and getting rid of file system.
On other hand, I think, a tweaked Apache Web Server or lighttpd should be enough. Caching can be left to OS or web server's defalt caching. There is no point in keeping data in DB if the same output is required and only queried based on file name. Not sure how memcached will work here. Also updating external cache (memcached) while updating file via external code will add complexity.
Also other question, if I choose to use files is is possible to store those in directory like \a\b\c\d.xml and access via abcd.xml? Or should I put all 1bn files in single directory (Not sure OS will allow it or not).
This is NOT a website, but for an application API in closed network so Cloud/CDN is of no use.
I am planning to use CentOS + Apache/lighttpd. Suggest any alternative and best possible solution.
This is the only public note found on such topic, and it is little old too.

1bn files at 1KB each, that's about 1TB of data. Impressive. So it won't fit into memory unless you have very expensive hardware. It can even be a problem on disk if your file system wastes a lot of space for small files.
30 requests a second is far less impressive. It's certainly not the limiting factor for the network nor for any serious web server out there. It might be a little challenge for a slow harddisk.
So my advice is: Put the XML files on a hard disk and serve them with a plain vanilla web server of your choice. Then measure the throughput and optimize it, if you don't reach 50 files a second. But don't invest into anything unless you have shown it to be a limiting factor.
Possible optimizations are:
Find a better layout in the file system, i.e. distribute your files over enough directories so that you don't have too many files (more than 5,000) in a single directory.
Distribute the files over several harddisks so that they can access the files in parallel
Use faster harddisk
Use solid state disks (SSD). They are expensive, but can easily serve hundreds of files a second.
If a large number of the files are requested several times a day, then even a slow hard disk should be enough because your OS will have the files in the file cache. And with today's file cache size, a considerable amount of your daily deliveries will fit into the cache. Because at 30 requests a second, you serve 0.25% of all files a day, at most.
Regarding distributing your files over several directories, you can hide this with an Apache RewriteRule, e.g.:
RewriteRule ^/xml/(.)(.)(.)(.)(.*)\.xml /xml/$1/$2/$3/$4/$5.xml

Another thing you could look at is Pomegranate, which seems very similar to what you are trying to do.

I believe that a dedicated application with everything feeding off a memcache db would be the best bet.

How to configure a Firebird Database to run in memory

I'm running a software called Fishbowl inventory and it is running on a firebird database (Windows server 2003) at this time the fishbowl software is running extremely slow when more then one user accesses the software. I'm thinking I maybe able to speed up the application by forcing the database to run "In Memory". However I can not find documentation on how to do this. Any help would be greatly appreciated.
Thank you in advance.
Robert

Firebird does not have memory tables - they may or may not be added in future versions (>3) but certainly not in the upcoming 2.5. There can be any other number of reasons why your software is slow with multiple users; however, Firebird itself has pretty good concurrency, so make sure you find the actual bottleneck first.

+1 to Holger. Find the bottleneck first.
Sinática Monitor may help you.

In-memory tables are nice either for OLAP (when data is not changing) or for temporary internal data storage.
In both cases data loss is not danger.
Pity that FB has no in-memory mode. I think about using SQLite as result.
As for caching, i think simple parallel thread that reads all the blocks of database file would make it in-memory - in OS cache if OS has enough memory.
But i also think, that OS already cached as much of DB file as it could and agressive forcing to cache would make overall performance even worse.

I had read an article some time ago, from someone who did a memory drive (like in old DOS) and ran a Database there. The problem is if anything fails, you lose everything. You should do backups very often to ensure a minimum of security.
Not a good idea at all I think.

How do you handle off-site backups of terabytes of data?

I have terabytes of files and database dumps that I need to backup off-site.
What's the best way to accomplish this?
I'm currently weighing rsyinc to Amazon EBS or getting an appliance (eg barracuda).
I called a buddy of mine, and he said he uses backula to get all the files on a single disk, then backs that disk up to tape, then sends the tapes off to iron mountain.
Still waiting to hear back from other sysadmins I've contacted. Will post results here.

One common solution to offsite backups that is worth considering is performing the backup onsite and then physically transporting the backup elsewhere, either via secure snail mail or with a service designed for that purpose. If bandwidth is an issue, this may be more practical.

Instead of tapes, I use hard drives that I physically swap out every week. It is less expensive than tape equipment, and easier to plug into another system when necessary.

Back in the late 80s I worked at a place where every week we received a box of tapes of various sorts every monday - we would do one set of weekly backups on the tapes on that box and send them off-site. Evidently they had two of these boxes, one that was in our office and the other they kept locked up somewhere. Then we got an Exabyte drive which had a single tape capacity greater than that whole box of TK-50s, QIC-40s and mag tapes, and it was just simpler to send a single tape home with one of the manager every week.
I'm sure there are still off-site backup systems like that, but I find it easier to keep cycling a couple of 500Gb drives from my home system to my desk at work.

Why not encrpyt it and actually upload to a third party vendor?
I am thinking of doing this with my data at home but have not found a vendor that will just let me do a dump...They all want to install client side apps...
Admittedly, I have not looked that hard...

We use a couple of solutions. We have an offsite backup with another company that we do. We also use several portable hard drives and swap them out each day. Neither solution really handles multiple terabytes of data. More like gigabytes.
In the future, however, we will probably be looking at going the tape router, or something else that is similarly permanent and storable. Terabytes of data is too much to transfer over the wire. When bluray discs become reasonably priced and commercially viable, it may be a good idea to look into the 400GB discs that were touted not long ago. Those would be extremely storage friendly (both in the physical sense and the file size sense), and depending on the longevity stats, may keep for a while, similar to tapes.

I would recommend using a local san from a company like EMC that provides compressed snapshot based replication to remote facilities. It's an expensive solution, but it works.
http://www.emc.com/products/family/emc-centera-family.htm

Over the weekend, I've heard back from a couple of my sysadmin buddies.
It seems the best practice is to backup all machines to a central large disk, then back that disk up to tape, then send the tapes off site (all have used Iron Mountain).
Tapes hold 400-800G and cost $30-$80 per tape.
A tape changer seems to go for $10k on up.
Not sure how much the off-site shipping costs.

I'm scared of tape. I think it gives a false sense of data security. In my own experience from backing up dozens of terrabytes across hundreds of tapes, we discovered that the data recovery rate after a few years fell to about 70%.
To be fair, that was with a now discontinued technology (AIT), but it pretty much put me off tape for life unless it sits on a 1" spool and is reassuringly expensive.
These days, multiple hard drives, multiple locations, and yes, a fall back into Amazon S3 or other cloud provider does no harm (apart from being a tad expensive).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas