How do you handle off-site backups of terabytes of data? - backup

I have terabytes of files and database dumps that I need to backup off-site.
What's the best way to accomplish this?
I'm currently weighing rsyinc to Amazon EBS or getting an appliance (eg barracuda).
I called a buddy of mine, and he said he uses backula to get all the files on a single disk, then backs that disk up to tape, then sends the tapes off to iron mountain.
Still waiting to hear back from other sysadmins I've contacted. Will post results here.

One common solution to offsite backups that is worth considering is performing the backup onsite and then physically transporting the backup elsewhere, either via secure snail mail or with a service designed for that purpose. If bandwidth is an issue, this may be more practical.

Instead of tapes, I use hard drives that I physically swap out every week. It is less expensive than tape equipment, and easier to plug into another system when necessary.

Back in the late 80s I worked at a place where every week we received a box of tapes of various sorts every monday - we would do one set of weekly backups on the tapes on that box and send them off-site. Evidently they had two of these boxes, one that was in our office and the other they kept locked up somewhere. Then we got an Exabyte drive which had a single tape capacity greater than that whole box of TK-50s, QIC-40s and mag tapes, and it was just simpler to send a single tape home with one of the manager every week.
I'm sure there are still off-site backup systems like that, but I find it easier to keep cycling a couple of 500Gb drives from my home system to my desk at work.

Why not encrpyt it and actually upload to a third party vendor?
I am thinking of doing this with my data at home but have not found a vendor that will just let me do a dump...They all want to install client side apps...
Admittedly, I have not looked that hard...

We use a couple of solutions. We have an offsite backup with another company that we do. We also use several portable hard drives and swap them out each day. Neither solution really handles multiple terabytes of data. More like gigabytes.
In the future, however, we will probably be looking at going the tape router, or something else that is similarly permanent and storable. Terabytes of data is too much to transfer over the wire. When bluray discs become reasonably priced and commercially viable, it may be a good idea to look into the 400GB discs that were touted not long ago. Those would be extremely storage friendly (both in the physical sense and the file size sense), and depending on the longevity stats, may keep for a while, similar to tapes.

I would recommend using a local san from a company like EMC that provides compressed snapshot based replication to remote facilities. It's an expensive solution, but it works.
http://www.emc.com/products/family/emc-centera-family.htm

Over the weekend, I've heard back from a couple of my sysadmin buddies.
It seems the best practice is to backup all machines to a central large disk, then back that disk up to tape, then send the tapes off site (all have used Iron Mountain).
Tapes hold 400-800G and cost $30-$80 per tape.
A tape changer seems to go for $10k on up.
Not sure how much the off-site shipping costs.

I'm scared of tape. I think it gives a false sense of data security. In my own experience from backing up dozens of terrabytes across hundreds of tapes, we discovered that the data recovery rate after a few years fell to about 70%.
To be fair, that was with a now discontinued technology (AIT), but it pretty much put me off tape for life unless it sits on a 1" spool and is reassuringly expensive.
These days, multiple hard drives, multiple locations, and yes, a fall back into Amazon S3 or other cloud provider does no harm (apart from being a tad expensive).

Related

Object storage for a web application

I am currently working on a website where, roughly 40 million documents and images should be served to it's users. I need suggestions on which method is the most suitable for storing content with subject to these requirements.
System should be highly available, scale-able and durable.
Files have to be stored permanently and users should be able to modify them.
Due to client restrictions, 3rd party object storage providers such as Amazon S3 and CDNs are not suitable.
File size of content can vary from 1 MB to 30 MB. (However about 90% of the files would be less than 2 MB)
Content retrieval latency is not much of a problem. Therefore indexing or caching is not very important.
I did some research and found out about the following solutions;
Storing content as BLOBs in databases.
Using GridFS to chunk and store content.
Storing content in a file server in directories using a hash and storing the metadata in a database.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
The website is developed using PHP and Couchbase Community Edition is used as the database.
I would really appreciate any input.
Thank you.
I have been working on a similar system for last two years, the work is still in progress. However, requirements are slightly different from yours: modifications are not possible (I will try to explain why later), file sizes fall in range from several bytes to several megabytes, and, the most important one, the deduplication, which should be implemented both on the document and block levels. If two different users upload the same file to the storage, the only copy of the file should be kept. Also if two different files partially intersect with each other, it's necessary to store the only copy of the common part of these files.
But let's focus on your requirements, so deduplication is not the case. First of all, high availability implies replication. You'll have to store your file in several replicas (typically 2 or 3, but there are techniques to decrease data parity) on independent machines in order to stay alive in case if one of the storage servers in your backend dies. Also, taking into account the estimation of the data amount, it's clear that all your data just won't fit into a single server, so vertical scaling is not possible and you have to consider partitioning. Finally, you need to take into account concurrency control to avoid race conditions when two different clients are trying to write or update the same data simultaneously. This topic is close to the concept of transactions (I don't mean ACID literally, but something close). So, to summarize, these facts mean that you're are actually looking for distributed database designed to store BLOBs.
On of the biggest problems in distributed systems is difficulties with global state of the system. In brief, there are two approaches:
Choose leader that will communicate with other peers and maintain global state of the distributed system. This approach provides strong consistency and linearizability guarantees. The main disadvantage is that in this case leader becomes the single point of failure. If leader dies, either some observer must assign leader role to one of the replicas (common case for master-slave replication in RDBMS world), or remaining peers need to elect new one (algorithms like Paxos and Raft are designed to target this issue). Anyway, almost whole incoming system traffic goes through the leader. This leads to the "hot spots" in backend: the situation when CPU and IO costs are unevenly distributed across the system. By the way, Raft-based systems have very low write throughput (check etcd and consul limitations if you are interested).
Avoid global state at all. Weaken the guarantees to eventual consistency. Disable the update of files. If someone wants to edit the file, you need to save it as new file. Use the system which is organized as a peer-to-peer network. There is no peer in the cluster that keeps the full track of the system, so there is no single point of failure. This results in high write throughput and nice horizontal scalability.
So now let's discuss the options you've found:
Storing content as BLOBs in databases.
I don't think it's a good option to store files in traditional RDBMS because they provide optimizations for structured data and strong consistency, and you don't need neither of this. Also you'll have difficulties with backups and scaling. People usually don't use RDBMS in this way.
Using GridFS to chunk and store content.
I'm not sure, but it looks like GridFS is built on the top of MongoDB. Again, this is document-oriented database designed to store JSONs, not BLOBs. Also MongoDB had problems with a cluster for many years. MongoDB passed Jepsen tests only in 2017. This may mean that MongoDB cluster is not mature yet. Make performance and stress tests, if you go this way.
Storing content in a file server in directories using a hash and storing the metadata in a database.
This option means that you need to develop object storage on your own. Consider all the problems I've mentioned above.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
I used neither of these solutions, but HDFS looks like overkill, because you get dependent on Hadoop stack. Have no idea about GlusterFS performance. Always consider the design of distributed file systems. If they have some kind of dedicated "metadata" serves, treat it as a single point of failure.
Finally, my thoughts on the solutions that may fit your needs:
Elliptics. This object storage is not well-known outside of the russian part of the Internet, but it's mature and stable, and performance is perfect. It was developed at Yandex (russian search engine) and a lot of Yandex services (like Disk, Mail, Music, Picture hosting and so on) are built on the top of it. I used it in previous project, this may take some time for your ops to get into it, but it's worth it, if you're OK with GPL license.
Ceph. This is real object storage. It's also open source, but it seems that only Red Hat people know how to deploy and maintain it. So get ready to a vendor lock. Also I heard that it have too complicated settings. Never used in production, so don't know about performance.
Minio. This is S3-compatible object storage, under active development at the moment. Never used it in production, but it seems to be well-designed.
You may also check wiki page with the full list of available solutions.
And the last point: I strongly recommend not to use OpenStack Swift (there are lot of reasons why, but first of all, Python is just not good for these purposes).
One probably-relevant question, whose answer I do not readily see in your post, is this:
How often do users actually "modify" the content?
and:
When and if they do, how painful is it if a particular user is served "stale" content?
Personally (and, "categorically speaking"), I prefer to tackle such problems in two stages: (1) identifying the objects to be stored – e.g. using a database as an index; and (2) actually storing them, this being a task that I wish to delegate to "a true file-system, which after all specializes in such things."
A database (it "offhand" seems to me ...) would be a very good way to handle the logical ("as seen by the user") taxonomy of the things which you wish to store, while a distributed filesystem could handle the physical realities of storing the data and actually getting it to where it needs to go, and your application would be in the perfect position to gloss-over all of those messy filesystem details . . .

What is a good access time to a database (SQL)?

Hey.. i wanna know which time is a good accesstime, because i'm searching for a good sql database and hsqldb says their accesstime is 12ms... <-- good?
I think it would depend on your needs. Is it for a web server or a desktop application? The amount of data is also important, because reading lots of small records will perform differently than reading a few large records. Access time is also based upon your hardware, software and maybe even some other factors.
For example, you can use a database with lightning-fast access, but if your users need to connect to it over a 5 megabit VPN connection, passing through three different proxies and with trafic world-wide, your database would then just be a waste of power.
Basically, it's a marketing thing that they're claiming. It's a good product but don't just focus on access time. Make sure you also look at your other needs. Another system might just perform better, even if it has a slower acess time, because it is more optimized in reading it's indices and stuff.
So, what do you want, exactly?
I don't think access time tells you anything, really. If you have slow or incorrectly configured storage, then this access time metric will be dwarfed by how much time is spent on waits and split I/Os. Network latency is also a factor, since I'm guessing you probably won't want to have your code on the same machine as your database, and you will most likely have a few network devices you'll need to traverse in your production environment.
In my experience, all the database platforms these days will all perform adequately if configured correctly and paired with a complementary application. Pick the DBMS that best fits your requirements, follow the best practices for configuration of the DBMS on your hardware, and you should be please with the outcome.

Backups for online businesses - better external hard drives or tape drives

We are an online business. Currently we are using DVD's for our backups. The problem is we are running out of space.
I guess there are two alternatives here:
external hard disk drives
tape drives
The important point is we want to carry the backup with us from the office to home every day.
Which alternative do you think would suit best our needs?
The important point is we want to
carry the backup with us from the
office to home every day.
This is on the level of "do we want to go to work or not". Backups that are not stored external ARE NOT BACKUPS. Ever heard of buildings going down? Burning out?
You NEED backups that are at least far enough to survive a larger fire.
Both scnearios are feasible. Tapes have more / larger capacity if you grow.
Also remember soonish we get... writeable 100gb Blue Ray discs ;)

TPC or other DB benchmarks for SSD drives

I have been interested in SSD drives for quite sometime. I do a lot of work with databases, and I've been quite interested to find benchmarks such as TPC-H performed with and without SSD drives.
On the outside it sounds like there would be one, but unfortunately I have not been able to find one. The closest I've found to an answer was the first comment in this blog post.
http://dcsblog.burtongroup.com/data_center_strategies/2008/11/intels-enterprise-ssd-performance.html
The fellow who wrote it seemed to be a pretty big naysayer when it came to SSD technology in the enterprise, due to a claim of lack of performance with mixed read/write workloads.
There have been other benchmarks such as
this
and
this
that show absolutely ridiculous numbers. While I don't doubt them, I am curious if what said commenter in the first link said was in fact true.
Anyways, if anybody can find benchmarks done with DBs on SSDs that would be excellent.
I've been testing and using them for a while and whilst I have my own opinions (which are very positive) I think that Anandtech.com's testing document is far better than anything I could have written, see what you think;
http://www.anandtech.com/show/2739
Regards,
Phil.
The issue with SSD is that they make real sense only when the schema is normalized to 3NF or 5NF, thus removing "all" redundant data. Moving a "denormalized for speed" mess to SSD will not be fruitful, the mass of redundant data will make SSD too cost prohibitive.
Doing that for some existing application means redefining the existing table (references) to views, encapsulating the normalized tables behind the curtain. There is a time penalty on the engine's cpu to synthesize rows. The more denormalized the original schema, the greater the benefit to refactor and move to SSD. Even on SSD, these denormalized schemas will run slower, likely, due to the mass of data which must be retrieved and written.
Putting logs on SSD is not indicated; this is a sequential write-mostly (write-only under normal circumstances) operation, physics of SSD (flash type; a company named Texas Memory Systems has been building RAM based sub-systems for a long time) makes this non-indicated. Conventional rust drives, duly buffered, will do fine.
Note the anandtech articles; the Intel drive was the only one which worked right. That will likely change by the end of 2009, but as of now only the Intel drives qualify for serious use.
I've been running a fairly large SQL2008 database on SSDs for 9 months now. (600GB, over 1 billion rows, 500 transactions per second). I would say that most SSD drives that I tested are too slow for this kind of use. But if you go with the upper end Intels, and carefully pick your RAID configuration, the results will be awesome. We're talking 20,000+ random read/writes per second. In my experience, you get the best results if you stick with RAID1.
I can't wait for Intel to ship the 320GB SSDs! They are expected to hit the market in September 2009...
The formal TPC benchmarks will probably take a while to appear using SSD because there are two parts to the TPC benchmark - the speed (transactions per unit time) and the cost per (transaction per unit time). With the high speed of SSD, you have to scale the size of the DB even larger, thus using more SSD, and thus costing more. So, even though you might get superb speed, the cost is still prohibitive for a fully-scaled (auditable, publishable) TPC benchmark. This will remain true for a while yet, as in a few years, while SSD is more expensive than the corresponding quantity of spinning disk.
Commenting on:
"...quite interested to find benchmarks such as TPC-H performed with and without SSD drives."
(FYI and full disclosure, I am pseudonymously "J Scouter", the "pretty big naysayer when it came to SSD technology in the enterprise" referred to and linked above.)
So....here's the first clue to emerge.
Dell and Fusion-IO have published the first EVER audited benchmark using a Flash-memory device for storage.
The benchmark is the TPC-H, which is a "decision support" benchmark. This is important because TPC-H entails an almost exclusively "read-only" workload pattern -- perfect context for SSD as it completely avoids the write performance problem.
In the scenarios painted for us by the Flash SSD hypesters, this application represents a soft-pitch, a gentle lob right over the plate and an easy "home run" for a Flash-SSD database application.
The results? The very first audited benchmark for a flash SSD based database application, and a READ ONLY one at that resulted in (drum roll here)....a fifth place finish among comparable (100GB) systems tested.
This Flash SSD system produced about 30% as many Queries-per-hour as a disk-based system result published by Sun...in 2007.
Surely though it will be in price/performance that this Flash-based system will win, right?
At $1.46 per Query-per-hour, the Dell/Fusion-IO system finishes in third place. More than twice the cost-per-query-per-hour of the best cost/performance disk-based system.
And again, remember this is for TPC-H, a virtually "read-only" application.
This is pretty much exactly in line with what the MS Cambridge Research team discovered over a year ago -- that there are no enterprise workloads where Flash makes ROI sense from economic or energy standpoints
Can't wait to see TPC-C, TPC-E, or SPC-1, but according the the research paper that was linked above, SSDs will need to become orders-of-magnitude cheaper for them to ever make sense in enterprise apps.

Offsite backups [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 9 years ago.
Improve this question
I was recently tasked with coming up with an offsite backup strategy. We have about 2TB of data that would need to be backed up so our needs are a little out of the norm.
I looked into Iron Mountain and they wanted $12,000 a month!
Does anyone have any suggestions on how best to handle backing up this much data on a budget (like a tenth of Iron Mountain)? How do other companies afford to do this?
Thanks!
UPDATE :: UPDATE
Ironically enough, I just had the sort of devastating failure we're all talking about. I had my BES server fail and than 2 days later 2 drives in my Exchange server's RAID5 died (2!!!??!). I'm currently in the process of rebuilding my network and the backup integrity is an definitely an issue.
At least now my bosses are paying attention :)
You can buy external eSATA RAID boxes in the 8TB capacity range for $2600. I'm not saying that particular product is the right choice, but that's the kind of box that will do 6TB in RAID5 and still be portable enough to buy a couple of them and rotate them through the bank, like Stu says.
Obviously if you have to have to keep 7 individual days worth, a 14 day, 30 and 90 day snapshot, etc. then things are going to be much more expensive, but it's certainly doable if what you're after is just disaster recovery.
The biggest thing to make sure is part of your plan is actually testing the restoration from the backup. That seems to get overlooked WAY too often and turns out to be the weakest link in nearly all of the strategies.
You should plan for scheduled restorations as often as is reasonable where you actually dump the real data and restore from the backup. Without that, you don't know that it will work when you NEED it too.
I've lost track of the number of times I've been in a company where there's a big rack full of backup tapes/drives, all dutifully made according to the schedule only to find out that NONE of them have valid data when the server gets wiped out.
The more ways you can verify the integrity of the backups the better, but nothing substitutes for doing an actual dump/load from one of your backups to really test the setup.
Amazon S3 might fit your budget better. I don't know if there is software available to automate the backup process but it's rather easy to write your own code to handle this. Here's their pricing calculator.
According to my estimates you're going to be well under the $1000/mo range.
You really have to assess the true value of your data. If you lost it tomorrow what impact would it have on your business? We use offsite backups, it isn't cheap, but if we were to lose our data the business would cease to trade withing 2-3 days.
We considered on-site backups as a possible cost saver but in my experience with data centres/computer rooms over the last ten years (as both an employee and a customer) I've seen fires, fire suppression system malfunctions (wet), hardware theft and one day a car crashed through an external wall right into the suite. Add to that our last DC was located at Heathrow, right next to the runways....you never know what strange things can happen (remember the BA 777 that got caught short of the runway on landing?).
My advice, assess the value of the data then decide if $12k is too rich to keep it safe.
2TB is chump change nowadays.
Look into hard-drive based hot-swappable backup machines, and rent a box at your local bank:
http://www.high-rely.com/ (there are many more products such as this, but my Google-time is limited).
Jungle Disk is one such piece of software that can automate the backup process to Amazon S3. I use it for backup at home, but I guess it could work just as well from a server. Also, there are probably other backup tools that make use of S3 for offsite storage.
We've been using DataDomain appliances for that purpose for about 2 years. They're not inexpensive, but compared to $12,000/month they'd pay for themselves pretty quickly.
Basically, we send our backups over NFS and CIFS to one DataDomain appliance, it deduplicates the data and then replicates the differences to the other appliance we have at a remote site.
As for pure online solutions, make sure you do some back-of-the-envelope calculations first. For example, if you have 2TB of churn a month, you are going to saturate a 1Mb Internet connection just for your backup traffic!
As previously mentioned, Amazon S3 is definitely an option, but it may be cheaper in the long run to own the hardware you are backing up to.
For example:
Buy a basic server and and eSATA RAID5 setup with 2-3 times the capacity you currently need, then install it at a co-location center. Preferably one with high, but cheap, bandwidth.
This way the server and storage is off-site, but after the initial cost of the hardware, you are only paying for bandwidth.
Granted, the downside to this is that, unlike something like S3, if the hardware goes down you have to go fix it yourself, or pay the CoLo people to. But this may be a tradeoff you are willing to make.
Also, with this solution, you are still going to need a beefy upload pipe to handle the traffic... so there's always the "sneakernet" solution.
I've used bqbackup.com for 1-2 years no problem. You can do a sync using rsync nightly. Wanted to add that their prices are dirt cheap, and I now have close to 1TB with them.