EC2 snapshots vs. bundled instances - amazon-s3

It's my understanding that EC2 snapshots are incremental in nature, so snapshot B contains only the difference between itself and snapshot A. Then if you delete snapshot A, the difference is allocated to snapshot B in Amazon S3 so that you have a complete snapshot. This also leads me to believe that it isn't prohibitively expensive to have daily snapshots A-Z for example, that it in storage cost it is basically the same as one snapshot.
What I really want is to back up my snapshots to a bucket in Amazon S3, so that if an entire EC2 region is having some problems --ahem cough, cough-- the snapshot can be moved into another region and launched as a backup instance in a new region.
However, it seems you can only bundle an instance and then upload a bundled instance to S3, not a snapshot.
The bundle is the entire instance correct? If this is the case then are historical bundled instances significantly more costly in practice than snapshots?

I use an instance store AMI and store my changing data on EBS volumes using the XFS filesystem. This means I can freeze the filesystems, create a snapshot and unfreeze them.
My volumes are 1GB (although mostly empty) and the storage cost is minuscule.
I don't know how an EBS backed AMI would work with this but I can't see why it would be any different. Note, however,that you need to bundle an instance in order to start it. Perhaps you could just snapshot everything as a backup and only bundle them when required.

Related

EBS snapshots: exporting only delta

Can I somehow export to S3 only EBS snapshot's delta (i.e changes from previous snapshot)?
I'm trying to export EC2 instance to S3 and exporting everytime AMI/full instance volumes looks expensive for me.
Anyway if I don't want to export using AMI way, I have to take EBS snapshot from instance volume and than
make temporary volume from it
attach temporary volume to other instance/agent and than copy contents to S3 bucket
And what I want - to avoid 1 step - i.e somehow directly get blocks changes according to internal EBS snapshot structure and push them to S3.
I heard, that EBS snapshots have already stored in S3 (but in invisible regions) - Amazon really doesn't provide any public API to 1to1 copying to user's regions or make them visible without restoring full history from first full snapshot?

On a google compute engine (GCE), where are snapshots stored?

I've made two snapshots using the GCE console. I can see them there on the console but cannot find them on my disks. Where are they stored? If something should corrupt one of my persistent disk, will the snapshots still be available? If they're not stored on the persistent disk, will I be charged extra for snapshot storage?
GCE has added a new level of abstraction. The disks were separated from the VM instance. This allows you to attach a disk to several instances or restore snapshots to another VMs.
In case your VM or disk become corrupt, the snapshots are safely stored elsewhere. As for additional costs - keep in mind that snapshots store only files that changed since the last snapshot. Therefore the space needed for 7 snapshots is often not more than 30% more space than one snapshot. You will be charged for the space they use, but the costs are quite low from what i observed (i was charged 0.09$ for 3.5 GB snapshot during one month).
The snapshots are stored separately on Google's servers, but are not attached to or part of your VM. You can create a new disk from an existing snapshot, but Google manages the internal storage and format of the snapshots.

Using Glacier as back end for web crawling

I will be doing a crawl of several million URLs from EC2 over a few months and I am thinking about where I ought to store this data. My eventual goal is to analyze it, but the analysis might not be immediate (even though I would like to crawl it now for other reasons) and I may want to eventually transfer a copy of the data out for storage on a local device I have. I estimate the data will be around 5TB.
My question: I am considering using Glacier for this, with the idea that I will run a multithreaded crawler that stores the crawled pages locally (on EB) and then use a separate thread that combines, compresses, and shuttles that data to Glacier. I know transfer speeds on Glacier are not necessarily good, but since there is no online element of this process, it would seem feasible (esp since I could always increase the size of my local EBS volume in case I'm crawling faster than I can store to Glacier).
Is there a flaw in my approach or can anyone suggest a more cost-effective, reliable way to do this?
Thanks!
Redshift seems more relevant than Glacier. Glacier is all about freeze / thaw and you'll have to move the data prior to doing any analysis.
Redshift is more about adding the data into a large, inexpensive, data warehouse and running queries over it.
Another option is to store the data in EBS and leave it there. When you're done with your crawling take a Snapshot to push the volume into S3 and decomission the volume and EC2 instance. Then when you're ready to do the analysis just create a volume from the snapshot.
The upside of this approach is that it's all file access (no formal data store) which may be easier for you.
Personally, I would probably push the data into Redshift. :-)
--
Chris
If your analysis will not be immediate then you can adopt one of the following 2 approaches
Approach 1) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> archive regularly to glacier. You can store your last X days data in Amazon S3 and use it for adhoc processing as well.
Approach 2) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon Glacier. Retrieve when needed and do the processing on EMR or other processing tools
If you need frequent analysis:
Approach 3) Amazon EC2 crawler -> store in EBS disks - Move them frequently to Amazon S3-> Analysis through EMR or other tools and store the processed results in S3/DB/MPP and move the raw files to glacier
Approach 4) if your data is structured, then Amazon EC2 crawler -> store in EBS disks and move them to Amazon RedShift and move the raw files to glacier
Additional tips:
If you can retrieve the data again(from source)then you can use ephemeral disks for your crawlers instead of EBS
Amazon has introduced Data pipeline service, check whether it fits your needs on data movement.

Query about EBS Backed Instances + Backup on S3 + Snapshots

I've spent a number of days looking into putting up two Windows Servers on Amazon, a domain controller and a remote desktop services server but there are a few questions that I can't find detailed or any answers for:
1) When you have an EBS backed instance I assume this means that all files (OS/Applications/Pagefile) etc are all stored on EBS? Physically in the datacentre, lets assume I have 50 gig of OS files/application data etc, are these all stored on just one SAN type device? What happens if that device blows up or say that particular data centre gets destroyed. Is the data elsewhere? What is the probability that your entire EBS volume can just disappear?
2) As I understand it you can backup your EBS instance to S3 with snapshotting. I assume you can choose how often to snapshot (say daily?). In my above scenario if I have 50 gig of files, and snapshot once a day. Over 7 days will my S3 storage be 350 gig or will it be 50 gig + incremental changes I have made over the week?
3) I remember reading somewhere that the instance has to go offline to snapshot. If that is the case does it do this by shutting down the guest OS, snapshotting then booting up or does it just detach the data, prevent you from connecting while it snapshots, then bring it back to the exact moment before it went for a snapshot.
4) I understand the concept of paying per month per gig of space but how I am concerned about the $0.11 per 1 million I/O requests. How does that work when I am running a windows server? I have no idea how many I/O requests a server makes to its disks. I am assuming a lot of the entire VM is being stored on an EBS volume. Is running a server on the standard EBS going to slow it down radically?
5) Are people using the snapshot to S3 as their main backup are are people running other types of backup for Data?
Sorry for the noob questions - I'd appreciate any partial answers, answers or advice anyone could offer me. Thanks in advance!
1) amazon is fuzzy on this. They say that data is replicated within the AZ it belongs to and that if you have less than 20GB of data changed since the last snapshot your annual failure rate is ~ 0.1-0.4%
2) snapshots are triggered manually, and are done incrementally
3) Depends on your filesystem. For example on a linux box with an xfs volume you can freeze IO to the volume, do your snapshot (takes only a second or so) and then unfreeze. If you take a snapshot without doing something similar you run the risk of the data being in an inconsistent state. This will depend on your filesystem
4) I run all my instances on EBS. You probably wouldn't want your pagefile on EBS, it would make more sense to use instance storage for that. The amount of IOs you use will be very dependant on the workload. The IO count depends heavily on your workload - an application server does a lot less IOPs than a database server for example. You're unlikely to use more than a few dollars a month per volume if you're running particularly IO heavy operations
5) Personally I don't care about the installed software/configuration (I have AMIs with that all setup so I can restore that in minutes), I only care about the data. I back that data up separately (S3 & Glacier). Partly that's because I was bitten by a bug EBS had about a year ago or so where they lost some snapshots
You also use multiple strategies, as Fantius commented. For example on the mongodb servers I run the boot volume is small (and never snapshotted or backed up since it can be restored automatically from an AMI), with a separate data volume containing the actual mongodb data. The mongodb volume is snapshotted as well as storing dumps on S3. Snapshots are an efficient way of creating backups (since you're only storing incremental changes) however you can't transfer them out of your EC2 region, whereas a tarball on S3 can easily be copied anywhere.

How to transfer an image to an Amazon EBS volume for EC2 usage?

I have a local filesystem image that I want to transfer to an Amazon EBS volume and boot as an EC2 micro instance. The instance should have the EBS volume as it's root filesystem - and I will be booting the instance with the Amazon PV-GRUB "kernels".
I have used ec2-bundle-image to create a bundle from the image, and I have used ec2-upload-bundle to upload the bundle to Amazon S3. However, now when I'd like to use ec2-register to register the image for usage, I can't seem to find a way to make the uploaded bundle be the ebs root image. It would seem that it requires an EBS snapshot to make the root device, and I have no idea how I would convert the bundle in to an EBS snapshot.
I do realize, that I could probably do this by starting a "common" instance, attaching an EBS volume to it and then just using 'scp' or something to transfer the image directly to the EBS volume - but is this really the only way? Also, I have no desire to use EBS snapshots as such, I'd rather have none - can I create a micro instance with just the EBS volume as root, without an EBS snapshot?
Did not find any way to do this :(
So, I created a new instance, attached a newly created EBS volume to, attached it to the instance, and transferred the data via ssh.
Then, to be able to boot the volume, I still need to create a snapshot of it and then create an AMI that uses the snapshot - and as a result, I get another EBS volume that is created from the snapshot and is the running instance's root volume.
Now, if I want to minimize expences, I can remove the created snapshot, and the original EBS volume.
NOTE: If the only copy of the EBS volume is the root volume of an instance, it may be deleted when the instance is terminated. This setting can be changed with the command-line tools - or the instance may simple by "stopped" instead of "terminated", and then a snapshot can be generated from the EBS volume. After taking a snapshot, the instance can ofcourse be terminated.
Yes, there is no way to upload an EBS Image via S3, and using a Instance where you attach an additional volume is the best way. If you attach that Volume after the instance is started, it will also not be deleted.
Note, do not worry too much about Volume->snapshot->Volume, as those share the same data blocks (as long as you dont modify them). The storage cost ist not trippled, only 1,1times one Volume. EBS snapshots and image creation is quite handy in that regard. Dont hesitate to use multiple snapshots. The less you "work" in a snapshot the smaller is its block usage later on if you start it as an AMI.