Pinning TensorFlow OpKernels to specific cores - tensorflow

I have written an OpKernel that is expensive and stateful. Using the default implementation of Eigen's NonBlockingThreadPool and the standard scheduling in this threadpool implementation means that
OpKernels are run on any available thread/core
State for this op must be transferred to the new core, which causes non-optimal cache behavior
Is there a way to pin expensive ops to run on specific cores?

That's not currently possible, but you're not the first person to have a similar need.

Related

What is the difference between the gem5 CPU models and which one is more accurate for my simulation?

When running a simulation in gem5, I can select a CPU with fs.py --cpu-type.
This option can also show a list of all CPU types if I use an invalid CPU type such as fs.py --cpu-type.
What is the difference between those CPU types and which one should I choose for my experiment?
Question inspired by: https://www.mail-archive.com/gem5-users#gem5.org/msg16976.html
An overview of the CPU types can be found at: https://cirosantilli.com/linux-kernel-module-cheat/#gem5-cpu-types
In summary:
simplistic CPUs (derived from BaseSimpleCPU): for example AtomicSimpleCPU (the default one). They have no CPU pipeline, and therefor are completely unrealistic. However, they also run much faster. Therefore,they are mostly useful to boot Linux fast and then checkpoint and switch to a more detailed CPU.
Within the simple CPUs we can notably distinguish:
AtomicSimpleCPU: memory requests finish immediately
TimingSimpleCPU: memory requests actually take time to go through to the memory system and return. Since there is no CPU pipeline however, the simulated CPU stalls on every memory request waiting for a response.
An alternative to those is to use KVM CPUs to speed up boot if host and guest ISA are the same, although as of 2019, KVM is less stable as it is harder to implement and debug.
in-order CPUs: derived from the generic MinorCPU by parametrization, Minor stands for In Order:
for ARM: HPI is made by ARM and models a "(2017) modern in-order Armv8-A implementation". This is your best in-order ARM bet.
out-of-order CPUs, derived from the generic DerivO3CPU by parametrization, O3 stands for Out Of Order:
for ARM: there are no models specifically published by ARM as of 2019. The only specific O3 model available is ex5_big for an A15, but you would have to verify its authors claims on how well it models the real core A15 core.
If none of those are accurate enough for your purposes, you could try to create your own in-order/out-of-order models by parametrizing MinorCPU / DerivO3CPU like HPI and ex5_big do, although this could be hard to get right, as there isn't generally enough public information on non-free CPUs to do this without experiments or reverse engineering.
The other thing you will want to think about is the memory system model. There are basically two choices: classical vs Ruby, and within Ruby, several options are available, see also: https://cirosantilli.com/linux-kernel-module-cheat/#gem5-ruby-build

Why Tensorflow model inference on GPU incurs so many CUDA cuEventRecord API calls?

I run a Tensorflow object detection model(one of these models) on one GPU (Tesla P100). To examine the performance bottleneck, I used Nvidia nvprof profiling tool to profile my object detection application (performing object detection on a few frames). The profiling result is shown as follows.
======== Profiling result:
2 Type Time(%) Time Calls Avg Min Max Name
3 API calls: 32.13% 15.2177s 434480 35.025us 5.1550us 954.27ms cudaLaunchKernel
4 30.20% 14.3065s 942706 15.175us 361ns 77.372ms cuEventRecord
5 13.39% 6.34349s 117067 54.186us 2.7000us 5.4721ms cudaFuncGetAttributes
6 6.26% 2.96509s 575202 5.1540us 562ns 1.2027ms cuEventQuery
7 6.16% 2.91725s 67072 43.494us 7.2690us 77.337ms cuMemcpyDtoHAsync
...
By looking at the Nvidia visual profiler, I found that the object detection application contains multiple threads. A couple of these threads keep invoking cuEventRecord CUDA driver API calls. The profiling result shows the duration of cuEventRecord API call is about 30% of the total duration time of cuda runtime+driver activities. I was wondering whether this cuEventRecord API call has something to do with the profiler: nvprof. If not, whether these cuEventRecord invocation would cause performance degradation for tensorflow model inference and what is the point to have these cuEventRecord API calls?
I was wondering whether this cuEventRecord API call has something to
do with the profiler: nvprof
It does not.
If not, whether these cuEventRecord invocation would cause performance degradation for tensorflow model inference.
They are part of the normal operation of Tensorflow.
what is the point to have these cuEventRecord API calls?
As I understand it, Tensorflow has been designed with a heavily pipelined device code path which relies on extensive use of events, stream synchronization, and stream callback functions to ensure that the GPU(s) are kept occupied and that the different phases of computation are scheduled, uploaded, and downloaded in the correct order.. That is likely what you see here.

TensorFlow Serving and serving more models than the memory can allow

TensorFlow Serving can serve multiple models by configuring the --model_config_file command line argument. I had success using this feature in small experiments.
However, it's unclear to me what happens when the total memory required by these models is larger than, say, the available GPU memory.
Does the server just crash? Or does it support keeping a subset of models available and possibly unloading/loading models based on the usage?
Thanks.
Trying to load a model when you are out of memory will fail to load that model. There's no dynamic loading/unloading at this time.
As currently written, it will crash if there isn't enough memory for all of the models requested to load. Internally there is a feature to gracefully decline to load a model that doesn't fit, which you could enable by writing a small PR that pipes the ServerCore::Options::total_model_memory_limit_bytes option [1] to a flag in main.cc. Note, however, that the notion of "fitting in memory" is based on a somewhat crude way of estimating model RAM footprint.
As Gautam said, it does not dynamically load/unload, although there is a library implemented for that (which isn't currently used in the released binary), called CachingManager [2].
[1] https://github.com/tensorflow/serving/blob/master/tensorflow_serving/model_servers/server_core.h#L112
[2] https://github.com/tensorflow/serving/blob/master/tensorflow_serving/core/caching_manager.h

Simple explanation of terms while running configure command during Tensorflow installation

I am installing tensorflow from this link.
When I run the ./configurecommand, I see following terms
XLA JIT
GDR
VERBS
OpenCL
Can somebody explain in simple language, what these terms mean and what are they used for?
XLA stands for 'Accelerated Linear Algebra'. The XLA page states that: 'XLA takes graphs ("computations") [...] and compiles them into machine instructions for various architectures." As far as I understand will this take the computation you define in tensorflow and compile it. Think of producing code in C and then running it through the C compiler for the CPU and load the resulting shared library with the code for the full computation instead of making separate calls from python to compiled functions for each part of your computation. Theano does something like this by default. JIT stands for 'just in time compiler', i.e. the graph is compiled 'on the fly'.
GDR seems to be support for exchanging data between GPUs on different servers via GPU direct. GPU direct makes it possible that e.g. the network card which receives data from another server over the network writes directly into the local GPU's memory without going through the CPU or main memory.
VERBS refers to the Infiniband Verbs application programming interface ('library'). Infiniband is a low latency network used in many supercomputers for example. This can be used when you want to run tensorflow on more than one server for communication between them. The Verbs API is to Infiniband what the Berkeley Socket API is to TCP/IP communication (although there are many more communication options and different semantics optimized for performance with Verbs).
OpenCL is a programming language suited for executing parallel computing tasks on CPU and non-CPU devices such as GPUs, with a C like syntax. With respect to C however there are certain restrictions such as no support for recursion etc. One could probably say that OpenCL is to AMD what CUDA is to NVIDIA (although also OpenCL is also used by other companies like ALTERA).

Changing TensorFlow operation device placement during runtime?

As far as I can see, TensorFlow is designed to have fully static device placement across a single tf.Session.run(). Is there a known ideal location to insert code for on-the-fly changing of operation device placement?
I'm aware of the static methods at a python level, but I'm looking for something at a C++ level such that I can do something akin to load balancing.
As an example, let's say I want TensorFlow to schedule operations to CPU and GPU in an alternating fashion (hardly ideal, I know). How might I do this at runtime so as operation dependencies are resolved and more operations are scheduled the environment of an operation is updated to be a different device? Would this best be done using the DeviceMgr to change execution device for the environment of a given operation in ExecutorState::Process(TaggedNode tagged_node, int64 scheduled_usec) right before the operation is launched (line 1651 of executor.cc)? Or am I misunderstanding when an operation is scheduled for execution through XLA and when is the latest time I can change the device placement?