Tensorflow imprecise timeouts - tensorflow

I've been testing out the the timeout functionality for sess.runs (applied to a convolutional neural network), and it seems like the timeouts aren't very precise.
For example, if I set the timeout to be 800 ms, there might be a 1-2 second delay before the timeout exception is triggered. This sort of leads me to believe that cancellation notifications aren't caught between computational nodes. (Which according to the timeline are .2-.5 s each)
So
1) Is there a way to make the timeouts more precise?
2) Are Tensorflow cancellation notifications caught between node computations?

The cancellation and timeout mechanism in TensorFlow was only designed to cancel a small number of blocking operations, in particular: dequeuing from an empty queue, enqueuing to a full queue, and reading from a file.
If you run a graph containing non-blocking operations, such as tf.matmul() and tf.nn.conv2d(), and the timeout expires, TensorFlow will typically wait until these operations have completed before returning with a "deadline exceeded" error.
Why is this the case? We added cancellation because users started to build pipelines of blocking operations into their graphs (e.g. for reading data) and some form of cancellation was needed to shut down these pipelines cleanly. Timeouts also help to debug deadlocks that can unfortunately occur in these pipelines. By contrast, TensorFlow is designed to dispatch non-blocking operations as efficiently as possible: for example, when running on a GPU, TensorFlow will asynchronously enqueue multiple operations on the GPU compute stream without blocking on their completion. Although it would technically be possible to check for cancellation between the execution of each operation, this would add latency to operation dispatch, and reduce overall performance in the common case.
However, if timeouts/cancellation for non-blocking operations would be useful for your use case, please feel free to open a GitHub issue as a feature request!

Related

Can I use Dispatchers.Default for CPU bound/intensive operations?

I wanted to understand best practices/approaches for performing blocking CPU operations in production grade server side applications. (Note that I am talking about server side and not Android apps)
What I mean by blocking CPU operation?
Operation which runs on and consumes CPU, for example, huge matrix multiplication, huge data transformation done in while loop etc.
My Setup
We are building kotlin dsl powered by coroutines
We have created our own CoroutineScope with single thread for non blocking operations so that mutating variables from callback are thread safe because all of them runs on same thread
We are recommending users to use withContext(Dispatchers.IO) for doing all the blocking IO operations
Open Questions
We have fixed strategy to do non blocking ops and thread safe mutations as well as doing IO ops as mentioned above
We are exploring options to perform blocking CPU bound operations as mentioned in the beginning
Some of the Solutions
Use Dispatchers.Default for do blocking cpu operations
Does anyone foresee any issue with using Default dispatcher?
What if other part of code/transitive libs are also using Default dispatcher?
Create separate CoroutineScope with fixed threads (usually equals to no. CPU threads) and then ask our users to use that isolated scope for running any blocking CPU bound operations
Which approach we should go with? pros and cons? Is there any other alternative good approach?
Side Note
Coming from Scala, it is usually not preferred to use Global execution context and Dispatchers.Default somehow maps to global context

Is an Empty VkCommandBuffer executed when submitted to a queue?

Hey guys I wonder if we submit a VkSubmitInfo containing one empty VkCommandBuffer to the queue, if it will be executed or ignored. I mean will the semaphores in VkSubmitInfo::pWaitSemaphore and VkSubmitInfo::pDestSemaphore still be considered when submitting an empty VkCommandBuffer ?
Looks a stupid question but what I want is to "multiply" the only one semaphore that gets out of the vkAcquireNextImageKHR.
I mean, I want to submit an empty commandbuffer with VkSubmitInfo::pWaitSemaphore pointing to "acquire_semaphore", and having VkSubmitInfo::pDstSemaphore having as many semaphores as I need.
if it will be executed or ignored.
What would be the difference? If there are no commands in the command buffer, then executing it will do nothing.
I mean, I want to submit an empty commandbuffer with VkSubmitInfo::pWaitSemaphore pointing to "acquire_semaphore", and having VkSubmitInfo::pDstSemaphore having as many semaphores as I need.
This has nothing to do with the execution of the CB itself. The behavior of a batch doesn't change just because the CB doesn't do anything.
However, unless you have multiple queues waiting on the completion of this queue's operations, there's really no reason to have multiple destination semaphores. The batch containing the real work could just wait on the pWaitSemaphores.
Also, there's no reason to have empty batches that only wait on a single semaphore. Let's say you have batch Q, which signals the pWaitSemaphores that this empty batch signals. Well, there's no reason that batch Q's pDstSemaphores couldn't signal the semaphores that you want the empty batch to signal. After all, vkQueueSubmit semaphore wait operations have, as its destination command scope, all subsequent commands for that queue from vkQueueSubmit calls, the current one or subsequent ones.
So you would only need an empty batch if you have to wait on multiple semaphores that are signals from different batches on different queues. And such a complex dependency layout strongly suggests an over-complicated dependency design that will lead to reduced performance.
Even waiting on acquire makes no sense for this. You only need to wait on acquire if that queue is going to manipulate to the acquired image. Well, you can't manipulate an image from multiple queues simultaneously. So there's no point in signaling a bunch of semaphores when acquire completes; that's why acquire only takes one.
So I want to simulate a Fence only with semaphores and see what goes faster.
This suggests strongly that you're thinking about things incorrectly.
You use a fence when you want the CPU to detect the completion of a GPU operation. For vkAcquireNextImageKHR, you would use a fence if you need the CPU to know when the image has been acquired.
Semaphores are about the GPU detecting when a GPU operation has completed, regardless of whether the operation comes from a queue or not. So if the GPU needs to wait until an image is acquired, you use a semaphore.
It doesn't matter which is faster because they do different things.

TensorFlow Execution on a single (multi-core) CPU Device

I have some questions regarding the execution model of TensorFlow in the specific case in which there is only a CPU device and the network is used only for inference, for instance using the Image Recognition(https://www.tensorflow.org/tutorials/image_recognition) C++ Example with a multi-core platform.
In the following, I will try to summarize what I understood, while asking some questions.
Session->Run() (file direct_session.cc) calls ExecutorState::RynAsynch, which initializes the TensorFlow ready queue with the roots nodes.
Then, the instruction
runner_([=]() { Process(tagged_node, scheduled_usec); }); (executor.cc, function ScheduleReady, line 2088)
assigns the node (and hence the related operation) to a thread of the inter_op pool.
However, I do not fully understand how it works.
For instance, in the case in which ScheduleReady is trying to assign more operations than the size of the inter_op pool, how operations are enqueued?(FIFO Order?)
Each thread of a pool has a queue of operation or there is a single shared queue?
Where can I found this in the code?
Where can I found the body of each thread of the pools?
Another question regards the nodes managed by inline_ready. How the execution of these (inexpensive or dead) nodes, differs from the one of the other nodes?
Then, (still, to my understanding) the execution flow continues from ExecutorState::Process, which executes the operation, distinguishing between synchronous and asynchronous operations.
How synchronous and asynchronous operations differs in terms of execution?
When the operation is executed, then PropagateOutputs (which calls ActivateNodes) adds to the ready queue the node of every successor which is become ready thanks to the execution of the current node(predecessor).
Finally, NodeDone() calls ScheduleReady() which process the nodes currently in the TensorFlow ready queue.
Conversely, how the intra_op thread pool is managed depends on the specific kernel, right? It is possible that a kernel requests more operations than the intra_op thread pool size?
If yes, with which kind of ordering they are enqueued? (FIFO?)
Once operations are assigned to threads of the pool, then their scheduling is left to the underlying operating system or TensorFlow enforces some kind of scheduling policy?
I'm asking here because I didn't find almost anything about this part of the execution model in the documentation, if I missed some documents please point me to all of them.
Re ThreadPool: When Tensorflow uses DirectSession (as it does in your case), it uses Eigen's ThreadPool. I could not get a web link to the official version of Eigen used in TensorFlow, but here is a link to the thread pool code. This thread pool is using this queue implementation RunQueue. There is one queue per thread.
Re inline_ready: Executor:Process is scheduled in some Eigen Thread. When it runs it executes some nodes. As these nodes are done, they make other nodes (tensorflow operations) ready. Some of these nodes are not expensive. They are added to inline_ready and executed in the same thread, without yielding. Other nodes are expensive and are not executed "immediately" in the same thread. Their execution is scheduled through the Eigen thread pool.
Re sync/async kernels: Tensorflow operations can be backed by synchronous (most CPU kernels) or asynchronous kernels (most GPU kernels). Synchronous kernels are executed in the thread running Process. Asynchronous kernels are dispatched to their device (usually GPU) to be executed. When asynchronous kernels are done, they invoke NodeDone method.
Re Intra Op ThreadPool: The intra op thread pool is made available to kernels to run their computation in parallel. Most CPU kernels don't use it (and GPU kernels just dispatch to GPU) and run synchronously in the thread that called the Compute method. Depending on configuration there is either one intra op thread pool shared by all devices (CPUs), or each device has its own. Kernels simply schedule their work on this thread pool. Here is an example of one such kernel. If there are more tasks than threads, they are scheduled and executed in unspecified order. Here is the ThreadPool interface exposed to kernels.
I don't know of any way tensorflow influences the scheduling of OS threads. You can ask it to do some spinning (i.e. not immediately yield the thread to OS) to minimize latency (from OS scheduling), but that is about it.
These internal details are not documented on purpose as they are subject to change. If you are using tensorflow through Python API, all you should need to know that your ops will execute when their inputs are ready. If you want to enforce some order beyond this, you should use:
with tf.control_dependencies(<tensors_that_you_want_computed_before_the_ops_inside_this_block>):
tf.foo_bar(...)
If you are writing a custom CPU kernel and want to do parallelism inside it (usually needed rarely for very expensive kernels), the thread pool interface linked above is what you can rely on.

What are the performance problems with "in-graph replication" in Tensorflow?

I've heard there are performance problems with "in-graph replication". What are they? Why do they exist?
Within-graph replication uses a single client process to drive execution, so this can become a bottleneck during replicated training.
Suppose you do model parallelism with 50 independent workers using single client.
This client would need to construct computation graph for all replicas. So if you have 50 replicas, this client has to handle 50x larger graph.
This client would need to drive session run calls for all replicas. If you are using Python client, you could have 50 threads in your client issuing session.run calls concurrently, but Python has performance issues with threads (GIL and thread scheduling issues like here)

Is it safe to access the hard drive via many different GCD queues?

Is it safe? For instance, if I create a bunch of different GCD queues that each compress (tar cvzf) some files, am I doing something wrong? Will the hard drive be destroyed?
Or does the system properly take care of such things?
Dietrich's answer is correct save for one detail (that is completely non-obvious).
If you were to spin off, say, 100 asynchronous tar executions via GCD, you'd quickly find that you have 100 threads running in your application (which would also be dead slow due to gross abuse of the I/O subsystem).
In a fully asynchronous concurrent system with queues, there is no way to know if a particular unit of work is blocked because it is waiting for a system resource or waiting for some other enqueued unit of work. Therefore, anytime anything blocks, you pretty much have to spin up another thread and consume another unit of work or risk locking up the application.
In such a case, the "obvious" solution is to wait a bit when a unit of work blocks before spinning up another thread to de-queue and process another unit of work with the hope that the first unit of work "unblocks" and continues processing.
Doing so, though, would mean that any asynchronous concurrent system with interaction between units of work -- a common case -- would be so slow as to be useless.
Far more effective is to limit the # of units of work that are enqueued in the global asynchronous queues at any one time. A GCD semaphore makes this quite easy; you have a single serial queue into which all units of work are enqueued. Every time you dequeue a unit of work, you increment the semaphore. Every time a unit of work is completed, you decrement the semaphore. As long as the semaphore is below some maximum value (say, 4), then you enqueue a new unit of work.
If you take something that is normally IO limited, such as tar, and run a bunch of copies in GCD,
It will run more slowly because you are throwing more CPU at an IO-bound task, meaning the IO will be more scattered and there will be more of it at the same time,
No more than N tasks will run at a time, which is the point of GCD, so "a billion queue entries" and "ten queue entries" give you the same thing if you have less than 10 threads,
Your hard drive will be fine.
Even though this question was asked back in May, it's still worth noting that GCD has now provided I/O primitives with the release of 10.7 (OS X Lion). See the man pages for dispatch_read and dispatch_io_create for examples on how to do efficient I/O with the new APIs. They are smart enough to properly schedule I/O against a single disk (or multiple disks) with knowledge of how much concurrency is, or is not, possible in the actual I/O requests.