Traffic Generator with Checkpoint - gem5

Is there any way to run Traffic Generator with Checkpoints. Basically my problem is I am running a trace through my memory model and it is faulting at around 110M instructions. Everytime I want to debug I have to wait till 110M instructions. Is it possible for me to save a checkpoint at lets say 100M and then resume from there for quick debugging?
Any guidance will be of great help!

Related

process killed while training

I working on training a machine learning model tend to image processing using GAN algorithm
it has been done on tensorflow backend ,I have split the work on 8 gpus
now when I start my training script It gave the following error
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
any body can crack this
I have been searching a lot and no solution on how to fix this
all I can find that something depending on the lake of memory ...thanks in advance and please inform me for any ambiguity
As someone's comment indicates, it looks like you ran out of Memory. To add to that, the exit code 137 likely means the Out Of Memory Killer killed your process. There is a good explanation of how it chooses which process to kill here that I often reference.
On your box you can confirm the OOM Killer's involvement by running dmesg
To get a sense for how much memory you have available to your processes you can run:
cat /proc/meminfo | grep MemTotal
try using less batch sizes, that might resolve the memory issue.

How to switch CPU models in gem5 after restoring a checkpoint and then observe the difference?

I want to boot the Linux kernel in full system (FS) mode with a lightweight CPU to save time, make a checkpoint after boot finishes, and then restore the checkpoint with a more detailed CPU to study a benchmark, as mentioned at: http://gem5.org/Checkpoints
However, when I tried to use -r 1 --restore-with-cpu= I cannot observe cycle differences between the new and old CPU.
The measure I'm looking at is how cache sizes affect the number of cycles that a benchmark takes to run.
The setup I'm using is described in detail at: Why doesn't the Linux kernel see the cache sizes in the gem5 emulator in full system mode? I'm looking at the cycle counts because I can't see cache sizes directly with the Linux kernel currently.
For example, if I boot the Linux kernel from scratch with the detailed and slow HPI model with command (excerpt):
./build/ARM/gem5.opt --cpu-type=HPI --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024
and then change cache sizes, the benchmark does get faster as the cache sizes get better as expected.
However, if I first boot without --cpu-type=HPI, which uses the faster AtomicSimpleCPU model:
./build/ARM/gem5.opt --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024
and then I create the checkpoint with m5 checkpoint and try to restore the faster CPU:
./build/ARM/gem5.opt --restore-with-cpu=HPI -r 1 --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024
then changing the cache sizes makes no difference: I always get the same cycle counts as I do for the AtomicSimpleCPU, indicating that the modified restore was not successful.
Analogous for x86 if I try to switch from AtomicSimpleCPU to DerivO3CPU.
Related old thread on the mailing list: http://thread.gmane.org/gmane.comp.emulators.m5.users/14395
Tested at: fbe63074e3a8128bdbe1a5e8f6509c565a3abbd4
From reading through some of the code I believe that --restore-with-cpu is specifically for the case when your checkpoint was created using a CPU model that isn't the AtomicCPU. The scripts assume that AtomicCPU was used to create the checkpoint. I think when restoring it's important to have the same cpu model as the system was checkpointed with, if you give another model with --cpu-type then it switches to that model after the restore operation as completed.
http://gem5.org/Checkpoints#Sampling has some (small) detail on switching cpu models
First, for your question, I don't see how cycle count being an indication of the restoration result. The cycle being restored should be the same regardless of what CPU you want to switch. Switching does not change the past cycles. When creating a checkpoint, you basically freeze the simulation at that state. And switching CPU simply changes all the parameter of the CPU while keeping the ticks unchanged. It is like hot swapping a CPU.
To correctly verify the restoration, you should keep a copy of config.json before restoration and compare it with the new one after restoration. For X86 case, I could find string AtomicSimpleCPU there only before restore.
Furthermore, only --cpu-type will determine the CPU being switched. But it does not make --restore-with-cpu useless. In fact, --restore-with-cpu should only be used when you boot up the system with a CPU other than AtomicSimpleCPU. Most people want to boot up the system with AtomicSimpleCPU and make a checkpoint since it is faster. But if you mistakenly boot up using DerivO3CPU, to restore this particular checkpoint, you have to configure --restore-with-cpu to DerivO3CPU. Otherwise, it will fail.
--cpu-type= affected the restore, but --restore-with-cpu= did not
I am not sure why that is, but I have empirically verified that if I do:
-r 1 --cpu-type=HPI
then as expected the cache size options start to affect cycle counts: larger caches leads to less cycles.
Also keep in mind that caches don't affect AtomicSimpleCPU much, and there is not much point in having them.
TODO so what is the point of --restore-with-cpu= vs --cpu-type if it didn't seem to do anything on my tests?
Except confuse me, since if --cpu-type != --restore-with-cpu, then the cycle count appears under system.switch_cpus.numCycles instead of system.cpu.numCycles.
I believe this is what is going on (yet untested):
switch_cpu contains stats for the CPU you switched to
when you set --restore-with-cpu= != --cpu-type, it thinks you have already
switched CPUs from the start
--restore-with-cpu has no effect on the initial CPU. It only
matters for options that switch the CPU during the run itself, e.g.
--fast-forward and --repeat_switch. This is where you will see both cpu and switch_cpu data get filled up.
TODO: also, if I use or remove --restore-with-cpu=, there is a small 1% cycle difference. But why is there a difference at all? AtomicSimpleCPU cycle count is completely different, so it must not be that it is falling back to it.
--cpu-type= vs --restore-with-cpu= showed up in fs.py --fast-forward: https://www.mail-archive.com/gem5-users#gem5.org/msg17418.html
Confirm what is happening with logging
One good sanity that the CPU want want is being used, is to enable some logging as shown at: https://github.com/cirosantilli/linux-kernel-module-cheat/tree/bab029f60656913b5dea629a220ae593cc16147d#gem5-restore-checkpoint-with-a-different-cpu e.g.:
--debug-flags ExecAll,FmtFlag,O3CPU,SimpleCPU
and shen see if you start to get O3 messages rather than SimpleCPU ones.

Google Cloud ML Engine "out-of-memory" error when Memory utilization is nearly zero

I am following the Tensorflow Object Detection API tutorial to train a Faster R-CNN model on my own dataset on Google Cloud. But the following "ran out-of-memory" error kept happening.
The replica master 0 ran out-of-memory and exited with a non-zero status of 247.
And according to the logs, a non-zero exit status -9 was returned. As described in the official documentation, a code of -9 might means the training is using more memory than allocated.
However, the memory utilization is lower than 0.2. So why I am having the memory problem? If it helps, the memory utilization graph is here.
The memory utilization graph is an average across all workers. In the case of an out of memory error, it's also not guaranteed that the final data points are successfully exported (e.g., a huge sudden spike in memory). We are taking steps to make the memory utilization graphs more useful.
If you are using the Master to also do evaluation (as exemplified in most of the samples), then the Master uses ~2x the RAM relative to a normal worker. You might consider using the large_model machine type.
The running_pets tutorial uses the BASIC_GPU tier, so perhaps the GPU has ran out of memory.
The graphs on ML engine currently only show CPU memory utilization.
If this is the case, changing your tier to larger GPUs will solve the problem. Here is some information about the different tiers.
On the same page, you will find an example of yaml file on how to configure it.
Looking at your error, it seems that your ML code is consuming more memory that it is originally allocated.
Try with a machine type that allows you more memory such as "large_model" or "complex_model_l". Use a config.yaml to define it as follows:
trainingInput:
scaleTier: CUSTOM
# 'large_model' for bigger model with lots of data
masterType: large_model
runtimeVersion: "1.4"
There is a similar question Google Cloud machine learning out of memory. Please refer to that link for actual solution.

Is there a way to write TensorFlow checkpoints asynchronously?

Currently I make checkpoints during training like this (pseudocode):
while(training):
model.train()
if it_is_time_for_validation():
metrics = model.validate()
if metrics.are_good():
saver = tf.train.Saver()
res = saver.save(sess=session, save_path=checkpoint_file_path)
Saver.save method blocks for I/O, preventing next iterations from running.
My model's weights size is hundreds of megabytes and it takes a while to write all this stuff.
By my calculations, depending on checkpoint frequency, overall, GPU spends 5-10% time waiting for checkpoints to finish, instead of doing useful calculations. (5-10% is an equivalent of a day of calculations)
Is there a way to perform checkpoints asynchronously to reduce the waste of computational time?
Implementation sketch: first we might copy everything necessary from the device memory to host, and perform disk I/O on a separate thread. Saver.save would return after memcopy, without waiting for disk operations, as it is safe to train the device copy now without screwing up the checkpoint. Saver.save would still block on re-entry if there is I/O pending from the previous iteration.
I don't think it's currently implemented, so I am interested in possible workarounds as well. Is this idea nice enough to be a feature request on GitHub?
You can write checkpoints asynchronously by running saver.save() in a separate thread. The (internal) SVTimerCheckpointThread is an example of code that runs saver.save() periodically in the background of training. Note that the tf.train.Supervisor is a utility class that helps with managing such background threads (also for writing TensorBoard summary logs, etc.), so you might want to use it instead.

Storing entire process state on disk and restoring it later? (On Linux/Unix)

I would like to know: Is there a system call, library, kernel module or command line tool I can use to store the complete state of a running program on the disk?
That is: I would like to completely dump the memory, page layout, stack, registers, threads and file descriptors a process is currently using to a file on the hard drive and be able to restore it later seamlessly, just like an emulator "savestate" or a Virtual Machine "snapshot".
I would also like, if possible, to have multiple "backup copies" of the program state, so I can revert to a previous execution point if the program dies for some reason.
Is this possible?
You should take a look at the BLCR project from Berkeley Lab.
This is widely used by several MPI implementations to provide
Checkpoint / Restart capabilities for parallel applications.
A core dump is basically this, so yes, it must be possible to get.
What you really want is a way to restore that dump as a running program. That might be more difficult.