Is there a way to write TensorFlow checkpoints asynchronously? - tensorflow

Currently I make checkpoints during training like this (pseudocode):
while(training):
model.train()
if it_is_time_for_validation():
metrics = model.validate()
if metrics.are_good():
saver = tf.train.Saver()
res = saver.save(sess=session, save_path=checkpoint_file_path)
Saver.save method blocks for I/O, preventing next iterations from running.
My model's weights size is hundreds of megabytes and it takes a while to write all this stuff.
By my calculations, depending on checkpoint frequency, overall, GPU spends 5-10% time waiting for checkpoints to finish, instead of doing useful calculations. (5-10% is an equivalent of a day of calculations)
Is there a way to perform checkpoints asynchronously to reduce the waste of computational time?
Implementation sketch: first we might copy everything necessary from the device memory to host, and perform disk I/O on a separate thread. Saver.save would return after memcopy, without waiting for disk operations, as it is safe to train the device copy now without screwing up the checkpoint. Saver.save would still block on re-entry if there is I/O pending from the previous iteration.
I don't think it's currently implemented, so I am interested in possible workarounds as well. Is this idea nice enough to be a feature request on GitHub?

You can write checkpoints asynchronously by running saver.save() in a separate thread. The (internal) SVTimerCheckpointThread is an example of code that runs saver.save() periodically in the background of training. Note that the tf.train.Supervisor is a utility class that helps with managing such background threads (also for writing TensorBoard summary logs, etc.), so you might want to use it instead.

Related

Should the amount of resource allocations be "per swap chain image"?

I just learned about uniform buffers (https://vulkan-tutorial.com/Uniform_buffers/Descriptor_layout_and_buffer) and a bit confused about the size of uniformBuffers and uniformBuffersMemory. In the tutorial it is said that:
We should have multiple buffers, because multiple frames may be in flight at the same time and we don't want to update the buffer in preparation of the next frame while a previous one is still reading from it! We could either have a uniform buffer per frame or per swap chain image.
As far as I understand "per swap chain image" approach is more optimal. Please, prove me wrong, if I am. But why do we need it to be the size of swapChainImages.size()? Isn't MAX_FRAMES_IN_FLIGHT just enough, because we have fences? As a simple example, if we have just a single frame in flight and do vkDeviceWaitIdle after each presentation then our single uniform buffer will always be available and not used by cpu/gpu so we don't need an array of them.
do vkDeviceWaitIdle
OK, stop right there. There is basically only one valid reason to call that function: you need to delete every resource created by that device, because you're about the destroy the device, so you wait until all such resources are no longer being used.
Yes, if you halt the CPU's execution until the GPU stops doing stuff, then you're guaranteed that CPU writes to GPU memory will not interact with GPU reads from that memory. But you purchased this guarantee by ensuring that there will be no overlap at all between CPU execution and GPU execution. The CPU sets up some stuff, sends it to the GPU, then waits till the GPU is done, and the CPU starts up again. Everything executes perfectly synchronously. While the CPU is doing work, the GPU is doing nothing. And vice-versa.
This is not a recipe for performance. If you're going to use a graphics API designed to achieve lots of CPU/GPU overlap, you shouldn't throw that away because it's easier to work with.
Get used to multi-buffering any resources that you modify from the CPU on a regular basis. How many buffers you want to use is your choice, one that should be informed by the present mode and the like.
My question is "Do I need n buffers or m is enough?".
The situation you're describing ultimately only happens if your code wanted to have X frames in flight, but the presentation engine requires you to use a minimum of Y swap-chain images, and X < Y. So the question you're asking can be boiled down to, "if I wanted to do double-buffering, but the implementation forces 3 buffers on me, is it OK if I treat it as double-buffering?"
Yes, as long as you're not relying on the vkAcquireNextImage call to block the CPU for your synchronization. But you shouldn't be relying on that anyway, since the call itself doesn't constitute a proper barrier as far as the Vulkan execution model is concerned. You should instead block the CPU on fences tied to the actual work, not on the acquire process.

How to switch CPU models in gem5 after restoring a checkpoint and then observe the difference?

I want to boot the Linux kernel in full system (FS) mode with a lightweight CPU to save time, make a checkpoint after boot finishes, and then restore the checkpoint with a more detailed CPU to study a benchmark, as mentioned at: http://gem5.org/Checkpoints
However, when I tried to use -r 1 --restore-with-cpu= I cannot observe cycle differences between the new and old CPU.
The measure I'm looking at is how cache sizes affect the number of cycles that a benchmark takes to run.
The setup I'm using is described in detail at: Why doesn't the Linux kernel see the cache sizes in the gem5 emulator in full system mode? I'm looking at the cycle counts because I can't see cache sizes directly with the Linux kernel currently.
For example, if I boot the Linux kernel from scratch with the detailed and slow HPI model with command (excerpt):
./build/ARM/gem5.opt --cpu-type=HPI --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024
and then change cache sizes, the benchmark does get faster as the cache sizes get better as expected.
However, if I first boot without --cpu-type=HPI, which uses the faster AtomicSimpleCPU model:
./build/ARM/gem5.opt --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024
and then I create the checkpoint with m5 checkpoint and try to restore the faster CPU:
./build/ARM/gem5.opt --restore-with-cpu=HPI -r 1 --caches --l1d_size=1024 --l1i_size=1024 --l2cache --l2_size=1024 --l3_size=1024
then changing the cache sizes makes no difference: I always get the same cycle counts as I do for the AtomicSimpleCPU, indicating that the modified restore was not successful.
Analogous for x86 if I try to switch from AtomicSimpleCPU to DerivO3CPU.
Related old thread on the mailing list: http://thread.gmane.org/gmane.comp.emulators.m5.users/14395
Tested at: fbe63074e3a8128bdbe1a5e8f6509c565a3abbd4
From reading through some of the code I believe that --restore-with-cpu is specifically for the case when your checkpoint was created using a CPU model that isn't the AtomicCPU. The scripts assume that AtomicCPU was used to create the checkpoint. I think when restoring it's important to have the same cpu model as the system was checkpointed with, if you give another model with --cpu-type then it switches to that model after the restore operation as completed.
http://gem5.org/Checkpoints#Sampling has some (small) detail on switching cpu models
First, for your question, I don't see how cycle count being an indication of the restoration result. The cycle being restored should be the same regardless of what CPU you want to switch. Switching does not change the past cycles. When creating a checkpoint, you basically freeze the simulation at that state. And switching CPU simply changes all the parameter of the CPU while keeping the ticks unchanged. It is like hot swapping a CPU.
To correctly verify the restoration, you should keep a copy of config.json before restoration and compare it with the new one after restoration. For X86 case, I could find string AtomicSimpleCPU there only before restore.
Furthermore, only --cpu-type will determine the CPU being switched. But it does not make --restore-with-cpu useless. In fact, --restore-with-cpu should only be used when you boot up the system with a CPU other than AtomicSimpleCPU. Most people want to boot up the system with AtomicSimpleCPU and make a checkpoint since it is faster. But if you mistakenly boot up using DerivO3CPU, to restore this particular checkpoint, you have to configure --restore-with-cpu to DerivO3CPU. Otherwise, it will fail.
--cpu-type= affected the restore, but --restore-with-cpu= did not
I am not sure why that is, but I have empirically verified that if I do:
-r 1 --cpu-type=HPI
then as expected the cache size options start to affect cycle counts: larger caches leads to less cycles.
Also keep in mind that caches don't affect AtomicSimpleCPU much, and there is not much point in having them.
TODO so what is the point of --restore-with-cpu= vs --cpu-type if it didn't seem to do anything on my tests?
Except confuse me, since if --cpu-type != --restore-with-cpu, then the cycle count appears under system.switch_cpus.numCycles instead of system.cpu.numCycles.
I believe this is what is going on (yet untested):
switch_cpu contains stats for the CPU you switched to
when you set --restore-with-cpu= != --cpu-type, it thinks you have already
switched CPUs from the start
--restore-with-cpu has no effect on the initial CPU. It only
matters for options that switch the CPU during the run itself, e.g.
--fast-forward and --repeat_switch. This is where you will see both cpu and switch_cpu data get filled up.
TODO: also, if I use or remove --restore-with-cpu=, there is a small 1% cycle difference. But why is there a difference at all? AtomicSimpleCPU cycle count is completely different, so it must not be that it is falling back to it.
--cpu-type= vs --restore-with-cpu= showed up in fs.py --fast-forward: https://www.mail-archive.com/gem5-users#gem5.org/msg17418.html
Confirm what is happening with logging
One good sanity that the CPU want want is being used, is to enable some logging as shown at: https://github.com/cirosantilli/linux-kernel-module-cheat/tree/bab029f60656913b5dea629a220ae593cc16147d#gem5-restore-checkpoint-with-a-different-cpu e.g.:
--debug-flags ExecAll,FmtFlag,O3CPU,SimpleCPU
and shen see if you start to get O3 messages rather than SimpleCPU ones.

Why should preprocessing be done on CPU rather than GPU?

The performance guide advises to do the preprocessing on CPU rather that on GPU. The listed reasons are
This prevent the data from going from CPU to GPU to CPU to GPU back again.
This frees the GPU of these tasks to focus on training.
I am not sure to understand either arguments.
Why would preprocessing send the result back to the CPU, esp. if all nodes are on GPU? Why preprocessing operations and not any other operation on the graph, why are they/should they be special?
Even though I understand the rationale behind putting the CPU to work rather than keeping it idle, compared to the huge convolutions and other gradient backpropagation a training step has to do, I would have assumed that random cropping, flip and other standard preprocessing steps on the input images should be nowhere near in term of computation needs, and should be executed in a fraction of the time. Even if we think of preprocessing as mostly moving things around (crop, flips), I think GPU memory should be faster for that. Yet doing preprocessing on the CPU can yield a 6+-fold increase in throughput according to the same guide.
I am assuming of course that preprocessing does not result in a drastic decrease in size of the data (e.g. supersampling or cropping to a much smaller size), in which case the gain in transfer time to the device is obvious. I suppose these are rather extreme cases and do not constitute the basis for the above recommendation.
Can somebody make sense out of this?
It is based on the same logic on how CPU and GPU works. GPU is good at doing repetitive parallelised tasks very well, whereas CPU is good at other computations, which require more processing capabilities.
For example, consider a program, which accepts inputs of two integers from the user and runs a for-loop for 1 Million times to sum the two numbers.
How we can achieve this with the combination of CPU and GPU processing?
We do the initial data (two user input integers) intercept part from the user on CPU and then send the two numbers to GPU and the for-loop to sum the numbers runs on the GPU because that is the repetitive, parallelizable yet simple computation part, which GPU is better at. [Although this example wasn't really exactly related to tensorflow but this concept is the heart of all CPU and GPU processing. Regarding your query: Processing abilities like random cropping, flip and other standard preprocessing steps on the input images might not be computational intensive but GPU doesn't excel in such kind of interrupt related computation either.]
Another thing we need to keep in mind that the latency between CPU and GPU also plays a key role here. Copying and transferring data to and fro CPU and GPU is expensive if compared to the transfer of data between different cache levels inside CPU.
As Dey, 2014 [1] have mentioned:
When a parallelized program is computed on the GPGPU, first the data
is copied from the memory to the GPU and after computation the data is
written back to the memory from the GPU using the PCI-e bus (Refer to
Fig. 1.8). Thus for every computation, data has to be copied to and
fro device-host-memory. Although the computation is very fast in
GPGPU, but because of the gap between the device-host-memory due to
communication via PCI-e, the bottleneck in the performance is
generated.
For this reason it is advisable that:
You do the preprocessing on CPU, where the CPU does the initial
computation, prepares and sends the rest of the repetitive
parallelised tasks to the GPU for further processing.
I once developed a buffer mechanism to increase the data processing between CPU and GPU, and henceforth reduce the negative effects of latency between CPU and GPU. Have a look at this thesis to gain a better understanding of this issue:
EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU)
Now, to answer your question:
Why would preprocessing send the result back to the CPU, esp. if all nodes are on GPU?
As quoted from the performance guide of Tensorflow [2],
When preprocessing occurs on the GPU the flow of data is CPU -> GPU
(preprocessing) -> CPU -> GPU (training). The data is bounced back and
forth between the CPU and GPU.
If you remember the dataflow diagram between the CPU-Memory-GPU mentioned above, the reason for doing the preprocessing on CPU improves performance because:
After computation of nodes on GPU, data is sent back on the memory
and CPU fetches that memory for further processing. GPU does not have
enough memory on-board (on GPU itself) to keep all the data on it for computational prupose. So
back-and-forth of data is inevitable. To optimise this data flow, you
do preprocessing on CPU, then the data (for training purposes), which is prepared for parallelizable tasks, is sent to the memory and GPU
fetches that preprocessed data and work on it.
In the performance guide itself it also mentions that by doing this, and having an efficient input pipeline, you won't starve either CPU or GPU or both, which itself proves the aforementioned logic. Again, in the same performance doc, you will also see the mentioning of
If your training loop runs faster when using SSDs vs HDDs for storing
your input data, you could could be I/O bottlenecked.If this is the
case, you should pre-process your input data, creating a few large
TFRecord files.
This again tries to mention the same CPU-Memory-GPU performance bottleneck, which is mentioned above.
Hope this helps and in case you need more clarification (on CPU-GPU performance), do not hesitate to drop a message!
References:
[1] Somdip Dey, EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU), 2014
[2] Tensorflow Performance Guide: https://www.tensorflow.org/performance/performance_guide
I quote at first two arguments from the performance guide and I think
your two questions concern these two arguments respectively.
The data is bounced back and forth between the CPU and GPU. ...
Another benefit is preprocessing on the CPU frees GPU time to focus on training.
(1) Operations like file reader, queue and dequeue can only be performed in CPU, operations like reshape, cast, per_image_standardization can be in CPU or GPU. So a wild guess for your first question: if the code doesn't specify /cpu:0, the program will perform in CPU by readers, then pre-process images in GPU, and finally enqueue and dequeue in CPU. (Not sure I am correct. waiting for an expert to verify...)
(2) For the second question, I have the same doubt too. When you train a large network, most of time is spent on the huge convolutions and the gradient computation, not on preprocessing images. However, when they mean 6X+ increase in samples/sec processed, I think they mean the training on MNIST, where a small network is usually used. So that makes sense. Smaller convolutions spend much less time so the time spent on preprocessing is relatively large. 6X+ increase is possible for this case. But preprocessing on the CPU frees GPU time to focus on training is a reasonable explanation.
Hope this could help you.

Tensorflow imprecise timeouts

I've been testing out the the timeout functionality for sess.runs (applied to a convolutional neural network), and it seems like the timeouts aren't very precise.
For example, if I set the timeout to be 800 ms, there might be a 1-2 second delay before the timeout exception is triggered. This sort of leads me to believe that cancellation notifications aren't caught between computational nodes. (Which according to the timeline are .2-.5 s each)
So
1) Is there a way to make the timeouts more precise?
2) Are Tensorflow cancellation notifications caught between node computations?
The cancellation and timeout mechanism in TensorFlow was only designed to cancel a small number of blocking operations, in particular: dequeuing from an empty queue, enqueuing to a full queue, and reading from a file.
If you run a graph containing non-blocking operations, such as tf.matmul() and tf.nn.conv2d(), and the timeout expires, TensorFlow will typically wait until these operations have completed before returning with a "deadline exceeded" error.
Why is this the case? We added cancellation because users started to build pipelines of blocking operations into their graphs (e.g. for reading data) and some form of cancellation was needed to shut down these pipelines cleanly. Timeouts also help to debug deadlocks that can unfortunately occur in these pipelines. By contrast, TensorFlow is designed to dispatch non-blocking operations as efficiently as possible: for example, when running on a GPU, TensorFlow will asynchronously enqueue multiple operations on the GPU compute stream without blocking on their completion. Although it would technically be possible to check for cancellation between the execution of each operation, this would add latency to operation dispatch, and reduce overall performance in the common case.
However, if timeouts/cancellation for non-blocking operations would be useful for your use case, please feel free to open a GitHub issue as a feature request!

Distributed Tensorflow: worker killed due to OOM

I'm running distributed tensorflow training similar to the Inception sample code but using this device setter:
with tf.device(tf.train.replica_device_setter(ps_tasks=1,
worker_device="/job:worker/task:%d" % FLAGS.task_id,
cluster=cluster_spec)):
The machine has 4 GPUs and 64 GB RAM. The ps job is running on CPU alone, and have two worker jobs running on 2 separate GPUs. The res memory footprint of both worker jobs gradually keeps increasing until around 3000 steps, the chief worker gets killed by OOM (both workers are occupying ~49% RAM before the crash). I have tried with a single worker too and that one gets killed too. The ps job has a much smaller footprint.
I have tried disabling summary ops, model saver, variables averager, reduced reader threads, but to no avail.
I fixed this issue with the workers by commenting the with tf.device('/cpu:0'): spec while calling batch_inputs in image_processing.py. One reason this may have happened with my setup though not completely clear is that I use
with tf.device(tf.train.replica_device_setter(ps_tasks=1,
worker_device="/job:worker/task:%d" % FLAGS.task_id,
cluster=cluster_spec)):
instead of
# Ops are assigned to worker by default.
with tf.device('/job:worker/task:%d' % FLAGS.task_id):
# Variables and its related init/assign ops are assigned to ps.
with slim.scopes.arg_scope(
[slim.variables.variable, slim.variables.global_step],
device=slim.variables.VariableDeviceChooser(num_parameter_servers)):
as the outermost training scope inside which the batch processing gets called (inception_distributed_train.py).
Not sure why exactly this became a problem for my modified setup (owing to no documentation about the mechanics of how device assignments are made), but now the memory increase trend has reduced at least ten-fold, and tested to run for a 100 epochs.
Maybe the original code will work fine without this CPU device specification as well.