Distributed Tensorflow: worker killed due to OOM - tensorflow

I'm running distributed tensorflow training similar to the Inception sample code but using this device setter:
with tf.device(tf.train.replica_device_setter(ps_tasks=1,
worker_device="/job:worker/task:%d" % FLAGS.task_id,
cluster=cluster_spec)):
The machine has 4 GPUs and 64 GB RAM. The ps job is running on CPU alone, and have two worker jobs running on 2 separate GPUs. The res memory footprint of both worker jobs gradually keeps increasing until around 3000 steps, the chief worker gets killed by OOM (both workers are occupying ~49% RAM before the crash). I have tried with a single worker too and that one gets killed too. The ps job has a much smaller footprint.
I have tried disabling summary ops, model saver, variables averager, reduced reader threads, but to no avail.

I fixed this issue with the workers by commenting the with tf.device('/cpu:0'): spec while calling batch_inputs in image_processing.py. One reason this may have happened with my setup though not completely clear is that I use
with tf.device(tf.train.replica_device_setter(ps_tasks=1,
worker_device="/job:worker/task:%d" % FLAGS.task_id,
cluster=cluster_spec)):
instead of
# Ops are assigned to worker by default.
with tf.device('/job:worker/task:%d' % FLAGS.task_id):
# Variables and its related init/assign ops are assigned to ps.
with slim.scopes.arg_scope(
[slim.variables.variable, slim.variables.global_step],
device=slim.variables.VariableDeviceChooser(num_parameter_servers)):
as the outermost training scope inside which the batch processing gets called (inception_distributed_train.py).
Not sure why exactly this became a problem for my modified setup (owing to no documentation about the mechanics of how device assignments are made), but now the memory increase trend has reduced at least ten-fold, and tested to run for a 100 epochs.
Maybe the original code will work fine without this CPU device specification as well.

Related

h2o task taking unexpectedly long leading to it getting stuck

I successfully initialise a cluster and train a DRF model. Then on the same cluster I try do a grid search for an XGBoost model.
H2OGridSearch(
H2OXGBoostEstimator(my_model_params),
hyper_params=my_grid_params,
search_criteria=my_search_criteria
)
Sometimes (not always) the grid search never finishes. Upon inspection in the H2O flow I found the job stuck at 0% progress with a 'RUNNING' status.
What I saw in the logs is the following
WARN: XGBoost task of type 'Boosting Iteration (tid=0)' is taking unexpectedly long, it didn't finish in 360 seconds.
WARN: XGBoost task of type 'Boosting Iteration (tid=0)' is taking unexpectedly long, it didn't finish in 420 seconds.
...
WARN: XGBoost task of type 'Boosting Iteration (tid=0)' is taking unexpectedly long, it didn't finish in 60240 seconds.
and after that I get
ERRR: water.api.HDFSIOException: HDFS IO Failure:
but the job's status is still 'RUNNING'.
I'm using h2o 3.30.0.6 via Python 3.7.
The problem is that the error is not reproducible and sometimes it just works fine.
Any hints on how to track down the root cause?
Is there a parameter I can set for killing the whole job when a boosting iteration takes too long?
For XGBoost, if it becomes unresponsive, you may need to allocate additional memory for it, since it uses memory independent of H2O (algortihms)
Why does my H2O cluster on Hadoop became unresponsive when running XGBoost even when I supplied 4 times the datasize memory?
This is why the extramempercent option exists, and we recommend setting this to a high value, such as 120. What happens internally is that when you specify -node_memory 10G and -extramempercent 120, the h2o driver will ask Hadoop for 10𝐺∗(1+1.2)=22𝐺 of memory. At the same time, the h2o driver will limit the memory used by the container JVM (the h2o node) to 10G, leaving the 10𝐺∗120 memory “unused.” This memory can be then safely used by XGBoost outside of the JVM. Keep in mind that H2O algorithms will only have access to the JVM memory (10GB), while XGBoost will use the native memory for model training. For example:
hadoop jar h2odriver.jar -nodes 1 -mapperXmx 20g -extramempercent 120
Source

How to switch CPU models in gem5 after restoring a checkpoint and then observe the difference?

I want to boot the Linux kernel in full system (FS) mode with a lightweight CPU to save time, make a checkpoint after boot finishes, and then restore the checkpoint with a more detailed CPU to study a benchmark, as mentioned at: http://gem5.org/Checkpoints
However, when I tried to use -r 1 --restore-with-cpu= I cannot observe cycle differences between the new and old CPU.
The measure I'm looking at is how cache sizes affect the number of cycles that a benchmark takes to run.
The setup I'm using is described in detail at: Why doesn't the Linux kernel see the cache sizes in the gem5 emulator in full system mode? I'm looking at the cycle counts because I can't see cache sizes directly with the Linux kernel currently.
For example, if I boot the Linux kernel from scratch with the detailed and slow HPI model with command (excerpt):
./build/ARM/gem5.opt --cpu-type=HPI --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024
and then change cache sizes, the benchmark does get faster as the cache sizes get better as expected.
However, if I first boot without --cpu-type=HPI, which uses the faster AtomicSimpleCPU model:
./build/ARM/gem5.opt --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024
and then I create the checkpoint with m5 checkpoint and try to restore the faster CPU:
./build/ARM/gem5.opt --restore-with-cpu=HPI -r 1 --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024
then changing the cache sizes makes no difference: I always get the same cycle counts as I do for the AtomicSimpleCPU, indicating that the modified restore was not successful.
Analogous for x86 if I try to switch from AtomicSimpleCPU to DerivO3CPU.
Related old thread on the mailing list: http://thread.gmane.org/gmane.comp.emulators.m5.users/14395
Tested at: fbe63074e3a8128bdbe1a5e8f6509c565a3abbd4
From reading through some of the code I believe that --restore-with-cpu is specifically for the case when your checkpoint was created using a CPU model that isn't the AtomicCPU. The scripts assume that AtomicCPU was used to create the checkpoint. I think when restoring it's important to have the same cpu model as the system was checkpointed with, if you give another model with --cpu-type then it switches to that model after the restore operation as completed.
http://gem5.org/Checkpoints#Sampling has some (small) detail on switching cpu models
First, for your question, I don't see how cycle count being an indication of the restoration result. The cycle being restored should be the same regardless of what CPU you want to switch. Switching does not change the past cycles. When creating a checkpoint, you basically freeze the simulation at that state. And switching CPU simply changes all the parameter of the CPU while keeping the ticks unchanged. It is like hot swapping a CPU.
To correctly verify the restoration, you should keep a copy of config.json before restoration and compare it with the new one after restoration. For X86 case, I could find string AtomicSimpleCPU there only before restore.
Furthermore, only --cpu-type will determine the CPU being switched. But it does not make --restore-with-cpu useless. In fact, --restore-with-cpu should only be used when you boot up the system with a CPU other than AtomicSimpleCPU. Most people want to boot up the system with AtomicSimpleCPU and make a checkpoint since it is faster. But if you mistakenly boot up using DerivO3CPU, to restore this particular checkpoint, you have to configure --restore-with-cpu to DerivO3CPU. Otherwise, it will fail.
--cpu-type= affected the restore, but --restore-with-cpu= did not
I am not sure why that is, but I have empirically verified that if I do:
-r 1 --cpu-type=HPI
then as expected the cache size options start to affect cycle counts: larger caches leads to less cycles.
Also keep in mind that caches don't affect AtomicSimpleCPU much, and there is not much point in having them.
TODO so what is the point of --restore-with-cpu= vs --cpu-type if it didn't seem to do anything on my tests?
Except confuse me, since if --cpu-type != --restore-with-cpu, then the cycle count appears under system.switch_cpus.numCycles instead of system.cpu.numCycles.
I believe this is what is going on (yet untested):
switch_cpu contains stats for the CPU you switched to
when you set --restore-with-cpu= != --cpu-type, it thinks you have already
switched CPUs from the start
--restore-with-cpu has no effect on the initial CPU. It only
matters for options that switch the CPU during the run itself, e.g.
--fast-forward and --repeat_switch. This is where you will see both cpu and switch_cpu data get filled up.
TODO: also, if I use or remove --restore-with-cpu=, there is a small 1% cycle difference. But why is there a difference at all? AtomicSimpleCPU cycle count is completely different, so it must not be that it is falling back to it.
--cpu-type= vs --restore-with-cpu= showed up in fs.py --fast-forward: https://www.mail-archive.com/gem5-users#gem5.org/msg17418.html
Confirm what is happening with logging
One good sanity that the CPU want want is being used, is to enable some logging as shown at: https://github.com/cirosantilli/linux-kernel-module-cheat/tree/bab029f60656913b5dea629a220ae593cc16147d#gem5-restore-checkpoint-with-a-different-cpu e.g.:
--debug-flags ExecAll,FmtFlag,O3CPU,SimpleCPU
and shen see if you start to get O3 messages rather than SimpleCPU ones.

Where do Workers and Parameter Servers reside in Distributed TensorFlow?

In this post, it was mentioned that:
Also, there's no built-in distinction between worker and ps devices --
it's just a convention that variables get assigned to ps devices, and
ops are assigned to worker devices.
In this post, it was mentioned that:
TL;DR: TensorFlow doesn't know anything about "parameter servers", but
instead it supports running graphs across multiple devices in
different processes. Some of these processes have devices whose names
start with "/job:ps", and these hold the variables. The workers drive
the training process, and when they run the train_op they will cause
work to happen on the "/job:ps" devices, which will update the shared
variables.
Questions:
Do variables in ps reside on the CPU or GPU? Also, are there any performance gains if "/job:ps" resides on CPU or GPU?
Do the lower level libraries decide where to place a variable or operation?
Do variables in ps reside on the CPU or GPU? Also, are there any performance gains if "/job:ps" resides on CPU or GPU?
You can pin ps job to either on of those (with exceptions, see below), but pinning it to GPU is not practical. ps is really a storage of parameters and ops to update it. A CPU device can have a lot more memory (i.e., main RAM) than a GPU and is fast enough to update the parameters as the gradients are coming in. In most cases, matrix multiplications, convolutions and other expensive ops are done by the workers, hence a placement of a worker on a GPU makes sense. A placement of a ps to a GPU is a waste of resources, unless a ps job is doing something very specific and expensive.
But: Tensorflow does not currently have a GPU kernel for integer variables, so the following code will fail when Tensorflow tries to place the variable i on GPU #0:
with tf.device("/gpu:0"):
i = tf.Variable(3)
with tf.Session() as sess:
sess.run(i.initializer) # Fails!
with the following message:
Could not satisfy explicit device specification '/device:GPU:0'
because no supported kernel for GPU devices is available.
This is the case when there's no choice of device for a parameter, and thus for a parameter server: only CPU.
Do the lower level libraries decide where to place a variable or operation?
If I get this question right, node placement rules are pretty simple:
If a node was already placed on a device in a previous run of the graph, it is left on that device.
Else, if the user pinned a node to a device via tf.device, the placer places it on that device.
Else, it defaults to GPU #0, or the CPU if there is no GPU.
Tensorflow whitepaper describes also a dynamic placer, which is more sophisticated, but it's not part of the open source version of tensorflow right now.

How does TensorFlow cluster distribute load across machines if not specified explicitly?

I took "Distributed TensorFlow" how-to and tried to apply it to the "MNIST For ML Beginners" tutorial. I started three TensorFlow worker nodes locally (there are 8 cores in the PC) and ran the training script with replacing this line:
sess = tf.InteractiveSession()
with the following:
sess = tf.InteractiveSession("grpc://localhost:12345")
where 12346 is a port where node 0 is listening (e.g. master session is created on node 0). Note that I did not specify explicitly where computations should be performed.
Looking at htop's output, I can see that the job is indeed performed by the cluster - it consumes some CPU. However, the only consumer is node 0, remaining nodes do not perform any work. If I select node 1 as a place to create master session, picture changes: only ~2/3 of the work is performed on node 0 (judging by CPU load), but the remaining 1/3 of the work is performed on node 1. If I select node 2 as master, then that 1/3 of the work is performed on node 2. If I run two processes in parallel, one using node 1 as master and another using node 2 as master, both nodes 1 and 2 get some load, but node 0 is loaded much more (like, 200% vs 60% vs 60% of CPU).
So far it looks like "default" behavior of distributed TensorFlow is not great for parallelizing work automatically right now. I'm wondering what the behavior is and whether distributed TensorFlow is intended for data parallelization at all (as opposed to manual model parallelization)?
TF is great for data parallelization, e.g. when you need to sift through tons of data, which is then distributed to multiple GPUs.
It's also great for weights parallelization. Using tf.train.replica_device_setter, weights are distributed among multiple devices for better IO.
Now, it seems you are asking for parallelization within a single model. That's difficult to do automatically, since TF does not know what's the best way to distribute your computation of the same model to multiple devices. It would depend on too many factors, e.g. how fast is the connection between your devices.

Is there a way to write TensorFlow checkpoints asynchronously?

Currently I make checkpoints during training like this (pseudocode):
while(training):
model.train()
if it_is_time_for_validation():
metrics = model.validate()
if metrics.are_good():
saver = tf.train.Saver()
res = saver.save(sess=session, save_path=checkpoint_file_path)
Saver.save method blocks for I/O, preventing next iterations from running.
My model's weights size is hundreds of megabytes and it takes a while to write all this stuff.
By my calculations, depending on checkpoint frequency, overall, GPU spends 5-10% time waiting for checkpoints to finish, instead of doing useful calculations. (5-10% is an equivalent of a day of calculations)
Is there a way to perform checkpoints asynchronously to reduce the waste of computational time?
Implementation sketch: first we might copy everything necessary from the device memory to host, and perform disk I/O on a separate thread. Saver.save would return after memcopy, without waiting for disk operations, as it is safe to train the device copy now without screwing up the checkpoint. Saver.save would still block on re-entry if there is I/O pending from the previous iteration.
I don't think it's currently implemented, so I am interested in possible workarounds as well. Is this idea nice enough to be a feature request on GitHub?
You can write checkpoints asynchronously by running saver.save() in a separate thread. The (internal) SVTimerCheckpointThread is an example of code that runs saver.save() periodically in the background of training. Note that the tf.train.Supervisor is a utility class that helps with managing such background threads (also for writing TensorBoard summary logs, etc.), so you might want to use it instead.