Query about EBS Backed Instances + Backup on S3 + Snapshots - amazon-s3

I've spent a number of days looking into putting up two Windows Servers on Amazon, a domain controller and a remote desktop services server but there are a few questions that I can't find detailed or any answers for:
1) When you have an EBS backed instance I assume this means that all files (OS/Applications/Pagefile) etc are all stored on EBS? Physically in the datacentre, lets assume I have 50 gig of OS files/application data etc, are these all stored on just one SAN type device? What happens if that device blows up or say that particular data centre gets destroyed. Is the data elsewhere? What is the probability that your entire EBS volume can just disappear?
2) As I understand it you can backup your EBS instance to S3 with snapshotting. I assume you can choose how often to snapshot (say daily?). In my above scenario if I have 50 gig of files, and snapshot once a day. Over 7 days will my S3 storage be 350 gig or will it be 50 gig + incremental changes I have made over the week?
3) I remember reading somewhere that the instance has to go offline to snapshot. If that is the case does it do this by shutting down the guest OS, snapshotting then booting up or does it just detach the data, prevent you from connecting while it snapshots, then bring it back to the exact moment before it went for a snapshot.
4) I understand the concept of paying per month per gig of space but how I am concerned about the $0.11 per 1 million I/O requests. How does that work when I am running a windows server? I have no idea how many I/O requests a server makes to its disks. I am assuming a lot of the entire VM is being stored on an EBS volume. Is running a server on the standard EBS going to slow it down radically?
5) Are people using the snapshot to S3 as their main backup are are people running other types of backup for Data?
Sorry for the noob questions - I'd appreciate any partial answers, answers or advice anyone could offer me. Thanks in advance!

1) amazon is fuzzy on this. They say that data is replicated within the AZ it belongs to and that if you have less than 20GB of data changed since the last snapshot your annual failure rate is ~ 0.1-0.4%
2) snapshots are triggered manually, and are done incrementally
3) Depends on your filesystem. For example on a linux box with an xfs volume you can freeze IO to the volume, do your snapshot (takes only a second or so) and then unfreeze. If you take a snapshot without doing something similar you run the risk of the data being in an inconsistent state. This will depend on your filesystem
4) I run all my instances on EBS. You probably wouldn't want your pagefile on EBS, it would make more sense to use instance storage for that. The amount of IOs you use will be very dependant on the workload. The IO count depends heavily on your workload - an application server does a lot less IOPs than a database server for example. You're unlikely to use more than a few dollars a month per volume if you're running particularly IO heavy operations
5) Personally I don't care about the installed software/configuration (I have AMIs with that all setup so I can restore that in minutes), I only care about the data. I back that data up separately (S3 & Glacier). Partly that's because I was bitten by a bug EBS had about a year ago or so where they lost some snapshots
You also use multiple strategies, as Fantius commented. For example on the mongodb servers I run the boot volume is small (and never snapshotted or backed up since it can be restored automatically from an AMI), with a separate data volume containing the actual mongodb data. The mongodb volume is snapshotted as well as storing dumps on S3. Snapshots are an efficient way of creating backups (since you're only storing incremental changes) however you can't transfer them out of your EC2 region, whereas a tarball on S3 can easily be copied anywhere.

Related

DynamoDB backup and restore using Data pipelines. How long does it take to backup and recover?

I'm planning to use Data pipelines as a backup and recovery tool for our DynamoDB. We will be using amazon's prebuilt pipelines to backup to s3, and use the prebuilt recovery pipeline to recover to a new table in case of a disaster.
This will also serve a dual purpose of data archival for legal and compliance reasons. We have explored snapshots, but this can get quite expensive compared to s3. Does anyone have an estimate on how long it takes to backup a 1TB database? And How long it takes to recover a 1TB database?
I've read amazon docs and it says it can take up to 20 minutes to restore from a snapshot but no mention of how long for a data pipeline. Does anyone have any clues?
Does the newly released feature of exporting from DynamoDB to S3 do what you want for your use case? To use this feature, you must have continuous backups enabled though. Perhaps that will give you the short term backup you need?
It would be interesting to know why you're not planning to use the built-in backup mechanism. It offers point in time recovery and it is highly predictable in terms of cost and performance.
The Data Pipelines backup is unpredictable, will very likely cost more and operationally it is much less reliable. Plus getting a consistent snapshot (ie point in time) requires stopping the world. Speaking from experience, I don't recommend using Data Pipelines for backing up DynamoDB tables!
Regarding how long it takes to take a backup, that depends on a number of factors but mostly on the size of the table and the provisioned capacity you're willing to throw at it, as well as the size of the EMR cluster you're willing to work with. So, it could take anywhere from a minute to several hours.
Restoring time also depends on pretty much the same variables: provisioned capacity and total size. And it can also take anywhere from a minute to many hours.
Point in time backups offer consistent, predictable and most importantly reliable performance regardless of the size of the table: use that!
And if you're just interested in dumping the data from the table (i.e not necessarily the restore part) use the new export to S3.

On a google compute engine (GCE), where are snapshots stored?

I've made two snapshots using the GCE console. I can see them there on the console but cannot find them on my disks. Where are they stored? If something should corrupt one of my persistent disk, will the snapshots still be available? If they're not stored on the persistent disk, will I be charged extra for snapshot storage?
GCE has added a new level of abstraction. The disks were separated from the VM instance. This allows you to attach a disk to several instances or restore snapshots to another VMs.
In case your VM or disk become corrupt, the snapshots are safely stored elsewhere. As for additional costs - keep in mind that snapshots store only files that changed since the last snapshot. Therefore the space needed for 7 snapshots is often not more than 30% more space than one snapshot. You will be charged for the space they use, but the costs are quite low from what i observed (i was charged 0.09$ for 3.5 GB snapshot during one month).
The snapshots are stored separately on Google's servers, but are not attached to or part of your VM. You can create a new disk from an existing snapshot, but Google manages the internal storage and format of the snapshots.

EBS Volume from Ubuntu to RedHat

I would like to use an EBS volume with data on it that I've been working with in an Ubuntu AMI in a RedHat 6 AMI. The issue I'm having is that RedHat says that the volume does not have a valid partition table. This is the fdisk output for the unmounted volume.
Disk /dev/xvdk: 901.9 GB, 901875499008 bytes
255 heads, 63 sectors/track, 109646 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/xvdk doesn't contain a valid partition table
Interestingly, the volume isn't actually 901.9 GB but 300 GB.. I don't know if that means anything. I am very concerned about possibly erasing the data in the volume by accident. Can anyone give me some pointers for formatting the volume for RedHat without deleting its contents?
I also just checked that the volume works in my Ubuntu instance and it definitely does.
I'm not able to advise on the partition issue as such, other than stating that you definitely neither need nor want to format it, because formatting is indeed a (potentially) destructive operation. My best guess would be that RedHat isn't able to identify the file system currently in use on the EBS volume, which must be advertized by some means accordingly.
However, to ease with experimenting and gain some peace of mind, you should get acquainted with one of the major Amazon EBS features, namely to create point-in-time snapshots of volumes, which are persisted to Amazon S3:
These snapshots can be used as the starting point for new Amazon EBS
volumes, and protect data for long-term durability. The same snapshot
can be used to instantiate as many volumes as you wish.
This is detailed further down in section Amazon EBS Snapshots:
Snapshots can also be used to instantiate multiple new volumes, expand
the size of a volume or move volumes across Availability Zones. When a
new volume is created, there is the option to create it based on an
existing Amazon S3 snapshot. In that scenario, the new volume begins
as an exact replica of the original volume. [...] [emphasis mine]
Therefore you can (and actually should) always start experiments or configuration changes like the one you are about to perform by at least snapshotting the volume (which will allow you to create a new one from that point in time in case things go bad) or creating a new volume from that snapshot immediately for the specific task at hand.
You can create snapshots and new volumes from snapshots via the AWS Management Console, as usual there are respective APIs available as well for automation purposes (see API and Command Overview) - see Creating an Amazon EBS Snapshot for details.
Good luck!.

Planning the development of a scalable web application

We have created a product that potentially will generate tons of requests for a data file that resides on our server. Currently we have a shared hosting server that runs a PHP script to query the DB and generate the data file for each user request. This is not efficient and has not been a problem so far but we want to move to a more scalable system so we're looking in to EC2. Our main concerns are being able to handle high amounts of traffic when they occur, and to provide low latency to users downloading the data files.
I'm not 100% sure on how this is all going to work yet but this is the idea:
We use an EC2 instance to host our admin panel and to generate the files that are being served to app users. When any admin makes a change that affects these data files (which are downloaded by users), we make a copy over to S3 using CloudFront. The idea here is to get data cached and waiting on S3 so we can keep our compute times low, and to use CloudFront to get low latency for all users requesting the files.
I am still learning the system and wanted to know if anyone had any feedback on this idea or insight in to how it all might work. I'm also curious about the purpose of projects like Cassandra. My understanding is that simply putting our application on EC2 servers makes it scalable by the nature of the servers. Is Cassandra just about keeping resource usage low, or is there a reason to use a system like this even when on EC2?
CloudFront: http://aws.amazon.com/cloudfront/
EC2: http://aws.amazon.com/cloudfront/
Cassandra: http://cassandra.apache.org/
Cassandra is a non-relational database engine and if this is what you need, you should first evaluate Amazon's SimpleDB : a non-relational database engine built on top of S3.
If the file only needs to be updated based on time (daily, hourly, ...) then this seems like a reasonable solution. But you may consider placing a load balancer in front of 2 EC2 images, each running a copy of your application. This would make it easier to scale later and safer if one instance fails.
Some other services you should read up on:
http://aws.amazon.com/elasticloadbalancing/ -- Amazons load balancer solution.
http://aws.amazon.com/sqs/ -- Used to pass messages between systems, in your DA (distributed architecture). For example if you wanted the systems that create the data file to be different than the ones hosting the site.
http://aws.amazon.com/autoscaling/ -- Allows you to adjust the number of instances online based on traffic
Make sure to have a good backup process with EC2, snapshot your OS drive often and place any volatile data (e.g. a database files) on an EBS block. EC2 doesn't fail often but when it does you don't have access to the hardware, and if you have an up to date snapshot you can just kick a new instance online.
Depending on the datasets, Cassandra can also significantly improve response times for queries.
There is an excellent explanation of the data structure used in NoSQL solutions that may help you see if this is an appropriate solution to help:
WTF is a Super Column

Offsite backups - possible with large amounts of code/source images etc?

The biggest hurdle I have in developing an effective backup strategy is being able to do some sort of offsite backup. Unfortunately, this can only be via uploading data to the offsite source but my internet cable has upload speeds which prohibit this.
Has anyone here managed to do offsite backups of large libraries of source code?
This is only relevant to home users and not in the workplace where budgets may open up doors.
EDIT: I am using Windows Vista (So 'nix solutions aren't relevant).
Thanks
I don't think your connections upload speed will be as prohibitive as you think. Just make sure you look for a solution where your changes can be sent as diffs. Even if your initial sync takes days, daily changes would likely be more manageable.
Knowing a few more specifics about how much data you are talking about, and exactly how slow your connection is, I think would allow the community to make more specific suggestions.
Services like Mozy allow you to back up large amounts of data offsite.
They upload slowly in the background, and getting the initial sync to the servers can take a while depending on your speed and amount of data, but after that they use efficient diffs to keep the stored data in sync.
The pricing is very home-friendly too.
I think you have to define backup and whats acceptable to you.
At my house, i have a hot backup of our repositories where I poll svn once an hour over the VPN and it takes down any check ins. this is just to catch any check ins that are not captured each 24 hours via the normal backup. I also send a full backup every 2 days through the pipe to be outside of the normal 3 tier backup we do at the office. our current zipped repository is 2GB zipped at max compression. That takes 34 hrs at 12 k/s and 17hrs at 24k/s, you did not say the speed of your connection, so its hard to judge if thats workable.
If this isnt viable, you might want to invest in a couple of 2.5" USB drives and load/swap them offsite to a safety deposit box at the bank. this used to be my responsibility but I lacked the discipline to do this consistently each week to assure some safety net. In the end it was just easier to live uploading the data to an ftp site at my house.