How to disregard checkpoints in Hyper-V? - virtual-machine

I start using Microsoft Hyper-V for testing before deployment to end-users and/or trouble-shooting. For this purpose, I have a rather fresh OS image of my employer, let's call the checkpoint root. If I mess this one up, I cannot easily recover it.
Now, for a complex test, I prepare the VM with pre-requisites and create a checkpoint test-start. I can revert to this, whenever I need to start a new run. So far everything is perfect.
The question: How can I simply disregard this test-start checkpoint without merging it into my root, once I finished the testing? If test-start would have child checkpoints, I would like to disregard them as well. In summary, I'd like to do delete checkpoint subtree, but without the merging that implicitly happens if doing so.
I have been searching quite a bit on the web, of course reading the MS docs as well, but couldn't find that kind of information. I hope this question is not too stup... ehm, trivial .. yet simple to answer for somebody.

I believe by deleting the test-start checkpoint and discarding the changes you mean to restore the virtual machine to its original state, i.e. the root checkpoint. That's exactly what's needed to be done. Revert the virtual machine to the root checkpoint and delete the test-start subtree. By reverting to root checkpoint, the test-start and its subtree will belong to a point in time after the current running state, so the corresponding AVHDX files will be deleted immediately without merging.

I found some other MS documentation. It seems so simple that deleting a checkpoint in another branch or ahead in time of the currently running VM, the merging is not happening, and "deleting" the checkpoint actually is a pure "delete" operation.

Related

Does TensorFlow support to save the initial hyper-parameter configuration automatically?

We need to run the designed networks many times for better performance and it would be better to record our experiments we have run. Maybe it could be good to provide to record these hyper-parameter configuration automatically by the tensorflow execution engine. For example, I record by set different directory name for the log directory as:
log_lr_0.001_theta_0.1_alpha_0.1
log_lr_0.01_theta_0.01_alpha_0.02
....
Are there any automatic ways to help this? In addition, it would be better that when we start a new tensorflow training instance, a new port will be allocated and a new tensorboard is started and shows its learning state.
No, tensorflow doesn't support initial hyper parameter configuration automatically.
I've faced the same issue as you, and I'm using a tool called Sacred, I hope you'd find that useful.

Sharing tensorflow model between processes

I have several processes running on a server, one of which is training a model in tensorflow. Periodically, I want the trainer to send the current model to the other processes. The way I do this now is with the usual Saver class that can save to and restore from disk.
However, I think this form of IPC is rather inefficient, and it is potentially causing filesystem lockups on the server. If there were a way to serialize the variables into some blob I could send that over a zmq broadcast pipe, I but I haven't found this in the docs.
Alternatively, distributed tensorflow is probably up to the task, but I don't think I need something so complicated.
You could preshare the architecture then use tf.get_collection(tf.GraphKeys.VARIABLES) at every number of steps of your liking, and run it to get the values, then you can use variable.assign at the other end to load up the values in the appropriate variables.

PyMC change backend after sampling

I have been using PyMC in an analysis of some high energy physics data. It has worked to perfection, the analysis is complete, and we are working on the paper.
I have a small problem, however. I ran the sampler with the RAM database backend. The traces have been sitting around in memory in an IPython kernel process for a couple of months now. The problem is that the workstation support staff want to perform a kernel upgrade and reboot that workstation. This will cause me to lose the traces. I would like to keep these traces (as opposed to just generating new), since they are what I've made all the plots with. I'd also like to include a portion of the traces (only the parameters of interest) as supplemental material with the publication.
Is it possible to take an existing chain in a pymc.MCMC object created with the RAM backend, change to a different backend, and write out the traces in the chain?
The trace values are stored as NumPy arrays, so you can use numpy.savetxt to send the values of each parameter to a file. (This is what the text backend does under the hood.)
While saving your current traces is a good idea, I'd suggest taking the time to make your analysis repeatable before publishing.

Is there a reverse-incremental backup solution with built-in redundancy (e.g. par2)?

I'm setting a home server primarily for backup use. I have about 90GB of personal data that must be backed up in the most reliable manner, while still preserving disk space. I want to have full file history so I can go back to any file at any particular date.
Full weekly backups are not an option because of the size of the data. Instead, I'm looking along the lines of an incremental backup solution. However, I'm aware that a single corruption in a set of incremental backups makes the entire series (beyond a point) unrecoverable. Thus simple incremental backups are not an option.
I've researched a number of solutions to the problem. First, I would use reverse-incremental backups so that the latest version of the files would have the least chance of loss (older files are not as important). Second, I want to protect both the increments and backup with some sort of redundancy. Par2 parity data seems perfect for the job. In short, I'm looking for a backup solution with the following requirements:
Reverse incremental (to save on disk space and prioritize the most recent backup)
File history (kind of a broader category including reverse incremental)
Par2 parity data on increments and backup data
Preserve metadata
Efficient with bandwidth (bandwidth saving; no copying the entire directory over for each increment). Most incremental backup solutions should work this way.
This would (I believe) ensure file integrity and relatively small backup sizes. I've looked at a number of backup solutions already but they have a number of problems:
Bacula - Simple normal incremental backups
bup - incremental and implements par2 but isn't reverse incremental and doesn't preserve metadata
duplicity - incremental, compressed, and encrypted but isn't reverse incremental
dar - incremental and par2 is easy to add, but isn't reverse incremental and no file history?
rdiff-backup - almost perfect for what I need but it doesn't have par2 support
So far I think that rdiff-backup seems like the best compromise but it doesn't support par2. I think I can add par2 support to backup increments easily enough since they aren't modified each backup but what about the rest of the files? I could generate par2 files recursively for all files in the backup but this would be slow and inefficient, and I'd have to worry about corruption during a backup and old par2 files. In particular, I couldn't tell the difference between a changed file and a corrupt file, and I don't know how to check for such errors or how they would affect the backup history. Does anyone know of any better solution? Is there a better approach to the issue?
Thanks for reading through my difficulties and for any input you can give me. Any help would be greatly appreciated.
http://www.timedicer.co.uk/index
Uses rdiff-backup as the engine. I've been looking at it, but that requires me to set up a "server" using linux or a virtual machine.
Personally, I use WinRAR to make pseudo-incremental backups (it actually makes a full backup of recent files) run daily by a scheduled task. It is similarly a "push" backup.
It's not a true incremental (or reverse-incremental) but it saves different versions of files based on when it was last updated. I mean, it saves the version for today, yesterday and the previous days, even if the file is identical. You can set the archive bit to save space, but I don't bother anymore as all I backup are small spreadsheets and documents.
RAR has it's own parity or recovery record that you can set in size or percentage. I use 1% (one percent).
It can preserve metadata, I personally skip the high resolution times.
It can be efficient since it compresses the files.
Then all I have to do is send the file to my backup. I have it copied to a different drive and to another computer in the network. No need for a true server, just a share. You can't do this for too many computers though as Windows workstations have a 10 connection limit.
So for my purpose, which may fit yours, backs up my files daily for files that have been updated in the last 7 days. Then I have another scheduled backup that backups files that have been updated in the last 90 days run once a month or every 30 days.
But I use Windows, so if you're actually setting up a Linux server, you might check out the Time Dicer.
Since nobody was able to answer my question, I'll write a few possible solutions I found while researching the topic. In short, I believe the best solution is rdiff-backup to a ZFS filesystem. Here's why:
ZFS checksums all blocks stored and can easily detect errors.
If you have ZFS set to mirror your data, it can recover the errors by copying from the good copy.
This takes up less space than full backups, even though the data is copied twice.
The odds of an error in both the original and mirror is tiny.
Personally I am not using this solution as ZFS is a little tricky to get working on Linux. Btrfs looks promising but hasn't been proven stable from years of use. Instead, I'm going with a cheaper option of simply checking hard drive SMART data. Hard drives should do some error checking/correcting themselves and by monitoring this data I can see if this process is working properly. It's not as good as additional filesystem parity but better than nothing.
A few more notes that might be interesting to people looking into reliable backup development:
par2 seems to be dated and buggy software. zfec seems like a much faster modern alternative. Discussion in bup occurred a while ago: https://groups.google.com/group/bup-list/browse_thread/thread/a61748557087ca07
It's safer to calculate parity data before even writing to disk. i.e. don't write to disk, read it, and then calculate parity data. Do it from ram, and check against the original for additional reliability. This might only be possible with zfec, since par2 is too slow.

Why doesn't Hadoop file system support random I/O?

The distributed file systems which like Google File System and Hadoop doesn't support random I/O.
(It can't modify the file which were written before. Only writing and appending is possible.)
Why did they design file system like this?
What are the important advantages of the design?
P.S I know Hadoop will support modifing the data which were written.
But they said, it's performance will very not good. Why?
Hadoop distributes and replicates files. Since the files are replicated, any write operation is going to have to find each replicated section across the network and update the file. This will heavily increase the time for the operation. Updating the file could push it over the block size and require the file split into 2 blocks, and then replicating the 2nd block. I don't know the internals and when/how it would split a block... but it's a potential complication.
What if the job failed or got killed which already did an update and gets re-run? It could update the file multiple times.
The advantage of not updating files in a distributed system is that you don't know who else is using the file when you update it, you don't know where the pieces are stored. There are potential time outs (node with the block is unresponsive) so you might end up with mismatched data (again, I don't know the internals of hadoop and an update with a node down might be handled, just something I'm brainstorming)
There are a lot of potential issues (a few laid out above) with updating files on the HDFS. None of them are insurmountable, but they will require a performance hit to check and account for.
Since the HDFS's main purpose is to store data for use in mapreduce, row level update isn't that important at this stage.
I think it's because of the block size of the data and the whole idea of Hadoop is that you don't move data around but instead you move the algorithm to the data.
Hadoop is designed for non-realtime batch processing of data. If you're looking at ways of implementing something more like a traditional RDBMS in terms of response time and random access have a look at HBase which is built on top of Hadoop.