h2o task taking unexpectedly long leading to it getting stuck - xgboost

I successfully initialise a cluster and train a DRF model. Then on the same cluster I try do a grid search for an XGBoost model.
H2OGridSearch(
H2OXGBoostEstimator(my_model_params),
hyper_params=my_grid_params,
search_criteria=my_search_criteria
)
Sometimes (not always) the grid search never finishes. Upon inspection in the H2O flow I found the job stuck at 0% progress with a 'RUNNING' status.
What I saw in the logs is the following
WARN: XGBoost task of type 'Boosting Iteration (tid=0)' is taking unexpectedly long, it didn't finish in 360 seconds.
WARN: XGBoost task of type 'Boosting Iteration (tid=0)' is taking unexpectedly long, it didn't finish in 420 seconds.
...
WARN: XGBoost task of type 'Boosting Iteration (tid=0)' is taking unexpectedly long, it didn't finish in 60240 seconds.
and after that I get
ERRR: water.api.HDFSIOException: HDFS IO Failure:
but the job's status is still 'RUNNING'.
I'm using h2o 3.30.0.6 via Python 3.7.
The problem is that the error is not reproducible and sometimes it just works fine.
Any hints on how to track down the root cause?
Is there a parameter I can set for killing the whole job when a boosting iteration takes too long?

For XGBoost, if it becomes unresponsive, you may need to allocate additional memory for it, since it uses memory independent of H2O (algortihms)
Why does my H2O cluster on Hadoop became unresponsive when running XGBoost even when I supplied 4 times the datasize memory?
This is why the extramempercent option exists, and we recommend setting this to a high value, such as 120. What happens internally is that when you specify -node_memory 10G and -extramempercent 120, the h2o driver will ask Hadoop for 10𝐺∗(1+1.2)=22𝐺 of memory. At the same time, the h2o driver will limit the memory used by the container JVM (the h2o node) to 10G, leaving the 10𝐺∗120 memory “unused.” This memory can be then safely used by XGBoost outside of the JVM. Keep in mind that H2O algorithms will only have access to the JVM memory (10GB), while XGBoost will use the native memory for model training. For example:
hadoop jar h2odriver.jar -nodes 1 -mapperXmx 20g -extramempercent 120
Source

Related

Optimizing Tensorflow for a 32-cores computer

I'm running a tensorflow code on an Intel Xeon machine with 2 physical CPU each with 8 cores and hyperthreading, for a grand total of 32 available virtual cores. However, I run the code keeping the system monitor open and I notice that just a small fraction of these 32 vCores are used and that the average CPU usage is below 10%.
I'm quite the tensorflow beginner and I haven't configured the session in any way. My question is: should I somehow tell tensorflow how many cores it can use? Or should I assume that it is already trying to use all of them but there is a bottleneck somewhere else? (for example, slow access to the hard disk)
TensorFlow will attempt to use all available CPU resources by default. You don't need to configure anything for it. There can be many reasons why you might be seeing low CPU usage. Here are some possibilities:
The most common case, as you point out, is the slow input pipeline.
Your graph might be mostly linear, i.e. a long narrow chain of operations on relatively small amounts of data, each depending on outputs of the previous one. When a single operation is running on smallish inputs, there is little benefit in parallelizing it.
You can also be limited by the memory bandwidth.
A single session.run() call takes little time. So, you end up going back and forth between python and the execution engine.
You can find useful suggestions here
Use the timeline to see what is executed when

How to switch CPU models in gem5 after restoring a checkpoint and then observe the difference?

I want to boot the Linux kernel in full system (FS) mode with a lightweight CPU to save time, make a checkpoint after boot finishes, and then restore the checkpoint with a more detailed CPU to study a benchmark, as mentioned at: http://gem5.org/Checkpoints
However, when I tried to use -r 1 --restore-with-cpu= I cannot observe cycle differences between the new and old CPU.
The measure I'm looking at is how cache sizes affect the number of cycles that a benchmark takes to run.
The setup I'm using is described in detail at: Why doesn't the Linux kernel see the cache sizes in the gem5 emulator in full system mode? I'm looking at the cycle counts because I can't see cache sizes directly with the Linux kernel currently.
For example, if I boot the Linux kernel from scratch with the detailed and slow HPI model with command (excerpt):
./build/ARM/gem5.opt --cpu-type=HPI --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024
and then change cache sizes, the benchmark does get faster as the cache sizes get better as expected.
However, if I first boot without --cpu-type=HPI, which uses the faster AtomicSimpleCPU model:
./build/ARM/gem5.opt --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024
and then I create the checkpoint with m5 checkpoint and try to restore the faster CPU:
./build/ARM/gem5.opt --restore-with-cpu=HPI -r 1 --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024
then changing the cache sizes makes no difference: I always get the same cycle counts as I do for the AtomicSimpleCPU, indicating that the modified restore was not successful.
Analogous for x86 if I try to switch from AtomicSimpleCPU to DerivO3CPU.
Related old thread on the mailing list: http://thread.gmane.org/gmane.comp.emulators.m5.users/14395
Tested at: fbe63074e3a8128bdbe1a5e8f6509c565a3abbd4
From reading through some of the code I believe that --restore-with-cpu is specifically for the case when your checkpoint was created using a CPU model that isn't the AtomicCPU. The scripts assume that AtomicCPU was used to create the checkpoint. I think when restoring it's important to have the same cpu model as the system was checkpointed with, if you give another model with --cpu-type then it switches to that model after the restore operation as completed.
http://gem5.org/Checkpoints#Sampling has some (small) detail on switching cpu models
First, for your question, I don't see how cycle count being an indication of the restoration result. The cycle being restored should be the same regardless of what CPU you want to switch. Switching does not change the past cycles. When creating a checkpoint, you basically freeze the simulation at that state. And switching CPU simply changes all the parameter of the CPU while keeping the ticks unchanged. It is like hot swapping a CPU.
To correctly verify the restoration, you should keep a copy of config.json before restoration and compare it with the new one after restoration. For X86 case, I could find string AtomicSimpleCPU there only before restore.
Furthermore, only --cpu-type will determine the CPU being switched. But it does not make --restore-with-cpu useless. In fact, --restore-with-cpu should only be used when you boot up the system with a CPU other than AtomicSimpleCPU. Most people want to boot up the system with AtomicSimpleCPU and make a checkpoint since it is faster. But if you mistakenly boot up using DerivO3CPU, to restore this particular checkpoint, you have to configure --restore-with-cpu to DerivO3CPU. Otherwise, it will fail.
--cpu-type= affected the restore, but --restore-with-cpu= did not
I am not sure why that is, but I have empirically verified that if I do:
-r 1 --cpu-type=HPI
then as expected the cache size options start to affect cycle counts: larger caches leads to less cycles.
Also keep in mind that caches don't affect AtomicSimpleCPU much, and there is not much point in having them.
TODO so what is the point of --restore-with-cpu= vs --cpu-type if it didn't seem to do anything on my tests?
Except confuse me, since if --cpu-type != --restore-with-cpu, then the cycle count appears under system.switch_cpus.numCycles instead of system.cpu.numCycles.
I believe this is what is going on (yet untested):
switch_cpu contains stats for the CPU you switched to
when you set --restore-with-cpu= != --cpu-type, it thinks you have already
switched CPUs from the start
--restore-with-cpu has no effect on the initial CPU. It only
matters for options that switch the CPU during the run itself, e.g.
--fast-forward and --repeat_switch. This is where you will see both cpu and switch_cpu data get filled up.
TODO: also, if I use or remove --restore-with-cpu=, there is a small 1% cycle difference. But why is there a difference at all? AtomicSimpleCPU cycle count is completely different, so it must not be that it is falling back to it.
--cpu-type= vs --restore-with-cpu= showed up in fs.py --fast-forward: https://www.mail-archive.com/gem5-users#gem5.org/msg17418.html
Confirm what is happening with logging
One good sanity that the CPU want want is being used, is to enable some logging as shown at: https://github.com/cirosantilli/linux-kernel-module-cheat/tree/bab029f60656913b5dea629a220ae593cc16147d#gem5-restore-checkpoint-with-a-different-cpu e.g.:
--debug-flags ExecAll,FmtFlag,O3CPU,SimpleCPU
and shen see if you start to get O3 messages rather than SimpleCPU ones.

Google Cloud ML Engine "out-of-memory" error when Memory utilization is nearly zero

I am following the Tensorflow Object Detection API tutorial to train a Faster R-CNN model on my own dataset on Google Cloud. But the following "ran out-of-memory" error kept happening.
The replica master 0 ran out-of-memory and exited with a non-zero status of 247.
And according to the logs, a non-zero exit status -9 was returned. As described in the official documentation, a code of -9 might means the training is using more memory than allocated.
However, the memory utilization is lower than 0.2. So why I am having the memory problem? If it helps, the memory utilization graph is here.
The memory utilization graph is an average across all workers. In the case of an out of memory error, it's also not guaranteed that the final data points are successfully exported (e.g., a huge sudden spike in memory). We are taking steps to make the memory utilization graphs more useful.
If you are using the Master to also do evaluation (as exemplified in most of the samples), then the Master uses ~2x the RAM relative to a normal worker. You might consider using the large_model machine type.
The running_pets tutorial uses the BASIC_GPU tier, so perhaps the GPU has ran out of memory.
The graphs on ML engine currently only show CPU memory utilization.
If this is the case, changing your tier to larger GPUs will solve the problem. Here is some information about the different tiers.
On the same page, you will find an example of yaml file on how to configure it.
Looking at your error, it seems that your ML code is consuming more memory that it is originally allocated.
Try with a machine type that allows you more memory such as "large_model" or "complex_model_l". Use a config.yaml to define it as follows:
trainingInput:
scaleTier: CUSTOM
# 'large_model' for bigger model with lots of data
masterType: large_model
runtimeVersion: "1.4"
There is a similar question Google Cloud machine learning out of memory. Please refer to that link for actual solution.

How do I mitigate CUDA's very long initialization delay?

Initializing CUDA in a newly-created process can take quite some time as long as a half-second or more on many server-grade machines of today. As #RobertCrovella explains, CUDA initialization usually includes establishment of a Unified Memory model, which involves harmonizing of device and host memory maps. This can take quite a long time for machines with a lot of memory; and there might be other factors contributing to this long delay.
This effect becomes quite annoying when you want to run a sequence of CUDA-utilizing processes, which do not use complicated virtual memory mappings: They each have to wait their their long wait - despite the fact that "essentially", they could just re-use whether initializations CUDA made the last time (perhaps with a bit of cleanup code).
Now, obviously, if you somehow rewrote the code for all those processes to execute within a single process - that would save you those long initialization costs. But isn't there a simpler approach? What about:
Passing the same state information / CUDA context between processes?
Telling CUDA to ignore most host memory altogether?
Making the Unified Memory harmonization more lazy than it is now, so that it only happens to the extent that it's actually necessary?
Starting CUDA with Unified Memory disabled?
Keeping some daemon process on the side and latching on to it's already-initialized CUDA state?
What you are asking about already exists. It is called MPS (MULTI-PROCESS SERVICE), and it basically keeps a single GPU context alive at all times with a daemon process that emulates the driver API. The initial target application is MPI, but it does bascially what you envisage.
Read more here:
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
http://on-demand.gputechconf.com/gtc/2015/presentation/S5584-Priyanka-Sah.pdf

Distributed Tensorflow: worker killed due to OOM

I'm running distributed tensorflow training similar to the Inception sample code but using this device setter:
with tf.device(tf.train.replica_device_setter(ps_tasks=1,
worker_device="/job:worker/task:%d" % FLAGS.task_id,
cluster=cluster_spec)):
The machine has 4 GPUs and 64 GB RAM. The ps job is running on CPU alone, and have two worker jobs running on 2 separate GPUs. The res memory footprint of both worker jobs gradually keeps increasing until around 3000 steps, the chief worker gets killed by OOM (both workers are occupying ~49% RAM before the crash). I have tried with a single worker too and that one gets killed too. The ps job has a much smaller footprint.
I have tried disabling summary ops, model saver, variables averager, reduced reader threads, but to no avail.
I fixed this issue with the workers by commenting the with tf.device('/cpu:0'): spec while calling batch_inputs in image_processing.py. One reason this may have happened with my setup though not completely clear is that I use
with tf.device(tf.train.replica_device_setter(ps_tasks=1,
worker_device="/job:worker/task:%d" % FLAGS.task_id,
cluster=cluster_spec)):
instead of
# Ops are assigned to worker by default.
with tf.device('/job:worker/task:%d' % FLAGS.task_id):
# Variables and its related init/assign ops are assigned to ps.
with slim.scopes.arg_scope(
[slim.variables.variable, slim.variables.global_step],
device=slim.variables.VariableDeviceChooser(num_parameter_servers)):
as the outermost training scope inside which the batch processing gets called (inception_distributed_train.py).
Not sure why exactly this became a problem for my modified setup (owing to no documentation about the mechanics of how device assignments are made), but now the memory increase trend has reduced at least ten-fold, and tested to run for a 100 epochs.
Maybe the original code will work fine without this CPU device specification as well.