TensorFlow Serving and serving more models than the memory can allow - tensorflow-serving

TensorFlow Serving can serve multiple models by configuring the --model_config_file command line argument. I had success using this feature in small experiments.
However, it's unclear to me what happens when the total memory required by these models is larger than, say, the available GPU memory.
Does the server just crash? Or does it support keeping a subset of models available and possibly unloading/loading models based on the usage?
Thanks.

Trying to load a model when you are out of memory will fail to load that model. There's no dynamic loading/unloading at this time.

As currently written, it will crash if there isn't enough memory for all of the models requested to load. Internally there is a feature to gracefully decline to load a model that doesn't fit, which you could enable by writing a small PR that pipes the ServerCore::Options::total_model_memory_limit_bytes option [1] to a flag in main.cc. Note, however, that the notion of "fitting in memory" is based on a somewhat crude way of estimating model RAM footprint.
As Gautam said, it does not dynamically load/unload, although there is a library implemented for that (which isn't currently used in the released binary), called CachingManager [2].
[1] https://github.com/tensorflow/serving/blob/master/tensorflow_serving/model_servers/server_core.h#L112
[2] https://github.com/tensorflow/serving/blob/master/tensorflow_serving/core/caching_manager.h

Related

How the GPU process non-graphic data in parallel?

As the introduction of programmable shaders in graphic pipeline enabled GPGPU concept which makes use of GPU as a general processing engine suited for parallel data.
However, as far as I know, because GPU is still used for graphic processing a lot compared to GPGPU, it makes use of lots of fixed graphic pipeline stages that cannot be programmed.
If my understanding is correct, when one data is processed by the GPU regardless of the type of data (graphic or general), it should be processed through the fixed graphic pipeline which includes programmable stages and non-programmable fixed stages.
Does that mean non-graphical processing should go through graphical processing stages even though it doesn't make use of it? Or can it bypass those fixed stages used for graphics? If one can explain how the GPU pipeline works for GPGPU I would appreciate it.
TL;DR:
GPGPU completely bypasses the rendering pipeline, but the pipeline is still used today.
GPUs consist of two main parts (in relation to your question). The first one is the processing part, which consists of the memory, registers, warp units, dispatchers and streaming processors. The other part is a set of controllers, that are responsible for geometry processing and the graphics pipeline. Those controllers just issue commands for the Streaming Processors on how to process the data for each of the steps of the rendering pipeline, either hardwired or based on user supplied shaders. NVidia calls them "PolyMorph Engine", AMD "Geometric Processor".
Historically, some of those controllers were hardwired to do things a single way, so you could only programm the vertexshader, fragmentshader and pixelshader. The tesselation controller e.g. was hardwired on the GPU and not user programmable. As demands grew, more and more of those controllers became user-programmable and today most of them are completely programmable (Wikipedia).
In the beginning days of GPGPU, the only way to do computing was to hack the available shaders by using a texture with your input data on a full-screen face to calculate the result and then read the rendered image back (See slide 26 on this introduction).
With CUDA, NVidia allowed users not only to program the shaders/polymorph Engine, but also directly interact with the Streaming processors and execute code on those (See slide 31 & 32).
This does not mean, that the graphics pipeline became obsolete, but now there is a way to completely bypass it and directly run code on the GPU processors. Nvidia has a nice explanation on how the pipeline works today, where you can also see both the PolyMorph Engine and the Streaming Processors here.
The Graphics pipeline still helps the dev by offloading repetitive and more complicated parts of the process, like managing the memory, managing warps, passing data and all that stuff. Theoretically you could probably write your own pipeline directly on the StreamingProcessors using CUDA and then render the result, but it would be tedious. Just how writing a GPGPU-Code using Shaders would be tedious.
Although old GPUs have pipelines hardcoded in the chip, modern GPU itself is just a large ASIC that can compute vectorized data at stupid fast speed. It is human who defines what it can do. So the render pipeline is defined in the graphics library like OpenGL, not in GPU. Thus, GPU does not care what it is computing, as long as it is vectorized data, it can do all the computation needed and give you a result.

OpenVINO unable to get optimum performance while running multiple inference engines

I am running multiple python processes( 4 in this case using multiprocessing module) for person detection (using ssd mobilenet model), each having it's own inference engine of OpenVINO. I am getting a very low FPS (not more than 10) for each process. My suspicion is the CPUs are not getting utilized optimally because the number of threads being spawned by each engine are high, which is adding to the overhead and also the sharing of CPUs across processes.
Also for single process, I am getting upto 60fps with OMP_NUM_THREADS set to 4.
My CPU details are:-
2 Sockets
4 cores each socket
1 thread each core
Total - 8 CPUs
So what would be the
Optimal value for OMP_NUM_THREADS in this case?
How can I avoid Sharing of CPUs across each process?
Currently I am playing with OMP_NUM_THREADS and KMP_AFFINITY variables, but just doing a hit and trail on setting the values. Any detail on how to set would be really helpful. Thanks
In case of multiple networks inference you may try to set OMP_WAIT_POLICY to PASSIVE.
BTW, OpenVINO 2019R1 moved from OpenMP to TBB. It might give better efficiency in case of deep learning networks pipeline.
In case if you are using the same model for all the processes consider to use OV multi-stream inference. Using this you can load single network and next to create a multiple infer requests. Using this you will have a better CPU utilization (if compare to running one infer request across multiple cores) and in result better throughput.
To understand how to use multi stream inference take a look on inference_engine/samples/python_samples/benchmark_app/benchmark sample
As well you can use benchmark sample to do a grid search to find an optimal configuration (number of streams, batch size).

Google Cloud ML Engine "out-of-memory" error when Memory utilization is nearly zero

I am following the Tensorflow Object Detection API tutorial to train a Faster R-CNN model on my own dataset on Google Cloud. But the following "ran out-of-memory" error kept happening.
The replica master 0 ran out-of-memory and exited with a non-zero status of 247.
And according to the logs, a non-zero exit status -9 was returned. As described in the official documentation, a code of -9 might means the training is using more memory than allocated.
However, the memory utilization is lower than 0.2. So why I am having the memory problem? If it helps, the memory utilization graph is here.
The memory utilization graph is an average across all workers. In the case of an out of memory error, it's also not guaranteed that the final data points are successfully exported (e.g., a huge sudden spike in memory). We are taking steps to make the memory utilization graphs more useful.
If you are using the Master to also do evaluation (as exemplified in most of the samples), then the Master uses ~2x the RAM relative to a normal worker. You might consider using the large_model machine type.
The running_pets tutorial uses the BASIC_GPU tier, so perhaps the GPU has ran out of memory.
The graphs on ML engine currently only show CPU memory utilization.
If this is the case, changing your tier to larger GPUs will solve the problem. Here is some information about the different tiers.
On the same page, you will find an example of yaml file on how to configure it.
Looking at your error, it seems that your ML code is consuming more memory that it is originally allocated.
Try with a machine type that allows you more memory such as "large_model" or "complex_model_l". Use a config.yaml to define it as follows:
trainingInput:
scaleTier: CUSTOM
# 'large_model' for bigger model with lots of data
masterType: large_model
runtimeVersion: "1.4"
There is a similar question Google Cloud machine learning out of memory. Please refer to that link for actual solution.

Is it possible to use TensorFlow Serving with distributed TensorFlow cluster to improve throughput/latency?

I'm looking into ways to improve latency and/or throughput of a TensorFlow Serving instance. I've seen the "Serving Inception" manual and three GitHub Issues (2, 3, 4), but all of them seem to create a separate instance of TensorFlow Serving per server and then choosing server on client. Issue 4 is actually about adding some load balancer in front of that stuff, which is currently absent in TensorFlow Serving itself.
However, there is also "Distributed TensorFlow" tutorial which shows how to join a set of machines into a fixed cluster and then manually "pin" some computations to some machines, which can improve both latency and throughput if model is "wide" and can be parallelized well. However, I do not see any mentions of combining this with TensorFlow Serving in either documentation.
Question is: is it possible to configure TensorFlow Serving to use distributed TensorFlow cluster?
I was able to make it create and use gRPC sessions (instead of local) with some hacks:
Make tensorflow/core/distributed_runtime/rpc:grpc_session target publicly visible (it's internal to tensorflow package by default) by modifying BUILD file.
Add it as a dependency to the tensorflow_serving/model_servers:tensorflow_model_server target.
Add an extra flag to tensorflow_model_server called --session_target which sets up session_bundle_config.session_target() in main.cc.
Run the binary with --session_target=grpc://localhost:12345, where localhost:12345 is an arbitrary node which will be used to create master sessions.
See my cluster performing some computations on behalf of TensorFlow Serving.
However, this set of hacks does not look enough for "real-world usage" for three reasons:
grpc_session target is probably internal for a reason.
As noticed in my other question, distributed TensorFlow works better when computations are manually "pinned" to specific machines. So, if we use TensorFlow Serving, we need a way to save those "pins" and model's structure becomes tied with cluster's structure. I'm not sure whether this information is exported with Exporter/Saver at all.
tensorflow_model_server creates session once - during bootstrap. If master node of the cluster goes down and then restores, serving server still holds the "old" session and cannot process further requests.
All in all, it looks like this scenario is not officially supported yet, but I'm not sure.
If your model fits into single machine, then it's hard to see how distributing it over many machines will improve throughput. Essentially you are taking computations which can be done independently and adding a dependency. If one of your machines is slow or crashes, instead of making some queries slow, it will make all queries sow.
That said, it's worth benchmarking to see if it helps, in which case it would make sense to ask for this use-case to be officially supported.
Regarding questions:
Worker assignments are done through device field in graph .pbtxt. Some importers/exporters clear those assignments and have clear_devices flag. You could open graph definition (.pbtxt file or equivalently, str(tf.get_default_graph().as_graph_def(), and grep for device strings to check)
If any worker restarts, or there's some temporary network connectivity your sess.run fails with error (Unavailable) and you need to recreate the session. This is handled automatically by MonitoredTrainingSession in tf.train, but you need to handle this yourself with serving.
If your model is not using images, or is not entirely too large, you shouldn't need too much compute for each inference/serve, and I'm saying this using Inception-v# which takes ~1 sec to serve a response to an image on a Google Cloud Platform n1-standard-1 machine.
Now that being said, perhaps its the throughput that you need to scale up and that is a different problem. Your best option for scale at that point would be to use Docker Swarm & Compose, as well as Kubernetes to help scale e up and serve your inference "micro-service". You could use flask to iterate over a sequence of requests also if your use-case warrants it.

How can I stream data directly into tensorflow as opposed to reading files on disc?

Every tensorflow tutorial I've been able to find so far works by first loading the training/validation/test images into memory and then processing them. Does anyone have a guide or recommendations for streaming images and labels as input into tensorflow? I have a lot of images stored on a different server and I would like to stream those images into tensorflow as opposed to saving the images directly on my machine.
Thank you!
Tensorflow does have Queues, which support streaming so you don't have to load the full data in memory. But yes, they only support reading from files on the same server by default. The real problem you have is that, you want to load in memory data from some other server. I can think of following ways to do this:
Expose your images using a REST service. Write your own queueing mechanism in python and read this data (using Urllib or something) and feed it to Tensorflow placeholders.
Instead of using python queues (as above) you can use Tensorflow queues as well (See this answer), although it's slighly more complicated. The advantage will be, tensorflow queues can use multiple cores giving you better performance, compared to normal python multi-threaded queues.
Use a network mount to fool your OS into believing the data is on the same machine.
Also, remember when using this sort of distributed setup, you will always incur network overhead (time taken for images to be transferred from Server 1 to 2), which can slow your training by a lot. To counteract this, you'll have to build a multi-threaded queueing mechanism with fetch-execute overlap, which is a lot of effort. An easier option IMO is to just copy the data into your training machine.
You can use the sockets package in Python to transfer a batch of images, and labels from your server to your host. Your graph needs to be defined to take a placeholder as input. The placeholder must be compatible with your batch size.