We want to measure the memory footprint of each model live in tensorflow serving container. We are using the open source tf container (https://hub.docker.com/r/tensorflow/serving)
Tried scraping the prometheus metrics but unfortunately it doesn't publish any metric having model memory or even overall memory. Is there any way to achieve this or any suggestion as well works
We want to safeguard our production systems from accidentally loading larger models which can bring the whole service down
Related
I'm loading two models A and B using the same tensorflow model server instance (running inside a single Docker container).
(using tensorflow_model_sever v2.5.1)
the models are ~5GB on disk, of which about 1.7GB is just the inference related nodes.
model A is only loaded once, and model B gets a new version every once in a while.
my client only requests predictions from model A. model B isn't used.
incidentally, both models have warmup data.
every time a new version is loaded for model B, the tail latency for graph runs in model A jumps from 20msec to > 100msec (!).
I'm getting the tail latency both from tensorflow model server - :tensorflow:serving:request_latency_bucket,
as well as from my GRPC client.
the container has plenty of available memory and CPU,
this is also seen with a GPU.
I tried changing num_load_threads and num_unload_threads, flush_filesystem_caches,
but to no avail.
so far didn't manage to get rid of it using GRPC hedging/manual double dispatch (but working on it), anybody ever seen this, and better yet managed to get over this?
thanks!
This is just a general question rather than a specific issue.
For an upcoming school assignment I'll need to use OpenCV. From what the professor told me an OpenCV model can take up to 8GB memory, but my graphics card (GTX 960) has only 2GB of VRAM. What will happen if I try training a model larger than 2GB? Can Tensorflow make use of my CPU memory to store the model?
CV2 is typically used to do some processing on images, like reading them in or resizing them. The model is independent of cv2 except where cv2 is used to input and preprocess the images and prepare them as an input to the model. The amount of memory you will use depends on the image sizes and on the model you build. To avoid requiring a lot of memory you typically do not want to put all your images into memory at once but instead process them in batches. Keras provides an number of ways to do that. Documentation for that is located here.
I am running in the following scenario:
Single Node Kubernetes Cluster (1x i7-8700K, 1x RTX 2070, 32GB RAM)
1 Tensorflow Serving Pod
4 Inference Client Pods
What the inference clients do is they get images from 4 separate cameras (1 each) and pass it to TF-Serving for inference in order to get the understanding of what is seen on the video feeds.
I have previously been doing inference inside the Inference Client Pods individually by calling TensorFlow directly but that hasn't been good on the RAM of the graphics card. Tensorflow Serving has been introduced to the mix quite recently in order to optimize RAM as we don't load duplicated models to the graphics card.
And the performance is not looking good, for a 1080p images it looks like this:
Direct TF: 20ms for input tensor creation, 70ms for inference.
TF-Serving: 80ms for GRPC serialization, 700-800ms for inference.
The TF-Serving pod is the only one that has access to the GPU and it is bound exclusively. Everything else operates on CPU.
Are there any performance tweaks I could do?
The model I'm running is Faster R-CNN Inception V2 from the TF Model Zoo.
Many thanks in advance!
This is from TF Serving documentation:
Please note, while the average latency of performing inference with TensorFlow Serving is usually not lower than using TensorFlow directly, where TensorFlow Serving shines is keeping the tail latency down for many clients querying many different models, all while efficiently utilizing the underlying hardware to maximize throughput.
From my own experience, I've found TF Serving to be useful in providing an abstraction over model serving which is consistent, and does not require implementing custom serving functionalities. Model versioning and multi-model which come out-of-the-box save you lots of time and are great additions.
Additionally, I would also recommend batching your requests if you haven't already. I would also suggest playing around with the TENSORFLOW_INTER_OP_PARALLELISM, TENSORFLOW_INTRA_OP_PARALLELISM, OMP_NUM_THREADS arguments to TF Serving. Here is an explanation of what they are
Maybe you could try OpenVINO? It's a heavily optimized toolkit for inference. You could utilize your i7-8700K and run some frames in parallel. Here are some performance benchmarks for very similar i7-8700T.
There is even OpenVINO Model Server which is very similar to Tensorflow Serving.
Disclaimer: I work on OpenVINO.
I am trying to train a very deep model on Cloud ML however i am having serious memory issues that i am not managing to go around. The model is a very deep convolutional neural network to auto-tag music.
The model for this can be found in the image below. A batch of 20 with a tensor of 12x38832x1 is inserted in the network.
The music was originally 465894x1 samples which was then split into 12 windows. Hence, 12x38832x1. When using the map_fn function each loop would have the seperate 38832x1 samples (conv1d).
Processing windows at a time yields better results than the whole music using one CNN. This was split prior to storing the data in TFRecords in order to minimise the needed processing during training. This is loaded in a queue with maximum queue size of 200 samples (ie 10 batches).
Once dequeue, it is transposed to have the 12 dimension first which then can be used in the map_fn function for processing of the windows. This is not transposed prior to being queued as the first dimension needs to match the batch dimension of the output which is [20, 50]. Where 20 is the batch size as the data and 50 are the different tags.
For each window, the data is processed and the results of each map_fn are superpooled using a smaller network. The processing of the windows is done by a very deep neural network which is giving me problems to keep as all the config options i am giving are giving me out of memory errors.
As a model i am using one similar to Census Tensorflow Model.
First and foremost, i am not sure if this is the best option since for evaluation a separate graph is built and not shared variables. This would require double the amount of parameters.
Secondly, as a cluster setup, i have been using one complex_l master, 3 complex_l workers and 3 large_model parameter servers. I do not know if am underestimating the amount of memory needed here.
My model has previously worked with a much smaller network. However, increasing it in size started giving me bad out of memory errors.
My questions are:
The memory requirement is big, but i am sure it can be processed on cloud ml. Am i underestimating the amount of memory needed? What are your suggestions about the cluster for such a network?
When using a train.server in the dispatch function, do you need to pass on the cluster_spec so it is used in the replica_device setter? Or does it allocate on it's own? When not using it, and setting tf.configProto of log placement, all the variables seem to be on the master worker. On the Census Example in the task.py this is not passed on. I can assume this is correct?
How does one calculate how much memory is needed for a model (rough estimate to select the cluster)?
Is there any other tensorflow core tutorial how to setup such big jobs? (other than Census)
When training a big model in distributed between-graph replication, does all the model need to fit on the worker, or the worker only does ops and then transmits the results to the PS. Does that mean that the workers can have low memory just for singular ops?
PS: With smaller models the network trained successfully. I am trying to deepen the network for better ROC.
Questions coming up from on-going troubleshooting:
When using the replica_device_setter with the parameter cluster, i noticed that the master has very little memory and CPU usage and checking the log placement there are very little ops on the master. I checked the TF_CONFIG that is loaded and it says the following for the cluster field:
u'cluster': {u'ps': [u'ps-4da746af4e-0:2222'], u'worker': [u'worker-4da746af4e-0:2222'], u'master': [u'master-4da746af4e-0:2222']}
On the other hand, in the tf.train.Clusterspec documentation, it only shows workers. Does that mean that the master is not considered as worker? What happens in such case?
Error is it Memory or something else? EOF Error?
I have a small web server that gets input in terms of sentences and needs to return a model prediction using Tensorflow Serving. It's working all fine and well using our single GPU, but now I'd like to enable batching such that Tensorflow Serving waits a bit to group incoming sentences before processing them together in one batch on the GPU.
I'm using the predesigned server framework with the predesigned batching framework using the initial release of Tensorflow Serving. I'm enabling batching using the --batching flag and have set batch_timeout_micros = 10000 and max_batch_size = 1000. The logging does confirm that batching is enabled and that the GPU is being used.
However, when sending requests to the serving server the batching has minimal effect. Sending 50 requests at the same time almost linearly scales in terms of time usage with sending 5 requests. Interestingly, the predict() function of the server is run once for each request (see here), which suggests to me that the batching is not being handled properly.
Am I missing something? How do I check what's wrong with the batching?
Note that this is different from How to do batching in Tensorflow Serving? as that question only examines how to send multiple requests from a single client, but not how to enable Tensorflow Serving's behind-the-scenes batching for multiple separate requests.
(I am not familiar with the server framework, but I'm quite familiar with HPC and with cuBLAS and cuDNN, the libraries TF uses to do its dot products and convolutions on GPU)
There are several issues that could cause disappointing performance scaling with the batch size.
I/O overhead, by which I mean network transfers, disk access (for large data), serialization, deserialization and similar cruft. These things tend to be linear in the size of the data.
To look into this overhead, I suggest you deploy 2 models: one that you actually need, and one that's trivial, but uses the same I/O, then subtract the time needed by one from another.
This time difference should be similar to the time running the complex model takes, when you use it directly, without the I/O overhead.
If the bottleneck is in the I/O, speeding up the GPU work is inconsequential.
Note that even if increasing the batch size makes the GPU faster, it might make the whole thing slower, because the GPU now has to wait for the I/O of the whole batch to finish to even start working.
cuDNN scaling: Things like matmul need large batch sizes to achieve their optimal throughput, but convolutions using cuDNN might not (At least it hasn't been my experience, but this might depend on the version and the GPU arch)
RAM, GPU RAM, or PCIe bandwidth-limited models: If your model's bottleneck is in any of these, it probably won't benefit from bigger batch sizes.
The way to check this is to run your model directly (perhaps with mock input), compare the timing to the aforementioned time difference and plot it as a function of the batch size.
By the way, as per the performance guide, one thing you could try is using the NCHW layout, if you are not already. There are other tips there.