What does "max_batch_size" mean in tensorflow-serving batching_config.txt? - gpu

I'm using tensorflow-serving on GPUs with --enable-batching=true.
However, I'm a little confused with max_batch_size in batching_config.txt.
My client sends a input tensor with a tensor shape [-1, 1000] in a single gRPC request, dim0 ranges from (0, 200]. I set max_batch_size = 100 and receive an error:
"gRPC call return code: 3:Task size 158 is larger than maximum batch
size 100"
"gRPC call return code: 3:Task size 162 is larger than maximum batch
size 100"
Looks like max_batch_size limits dim0 of a single request, but tensorflow batches multiple requests to a batch, I thought it means the sum of request numbers.

Here is a direct description from the docs.
max_batch_size: The maximum size of any batch. This parameter governs
the throughput/latency tradeoff, and also avoids having batches that
are so large they exceed some resource constraint (e.g. GPU memory to
hold a batch's data).
In ML most of the time the first dimension represents a batch. So based on my understanding tensorflow serving confuses the value for the first dimension as a batch and issues errors whenever it is bigger than the allowed value. You can verify it by issuing some of the request where you manually control the first dimension to be lower than 100. I expect this to remove the error.
After that you can modify your inputs to be sent in a proper format.

Related

How to run a bigger batch with AWS SageMaker Batch Transform

I created an XGBoost model with AWS SageMaker. Now I'm trying to use it through Batch Transform Job, and it's all going pretty well for small batches.
However, there's a slightly bigger batch of 600.000 rows in a ~16MB file and I can't manage to run it in one go. I tried two things:
1.
Setting 'Max payload size' of the Transform job to its maximum (100 MB):
transformer = sagemaker.transformer.Transformer(
model_name = config.model_name,
instance_count = config.inference_instance_count,
instance_type = config.inference_instance_type,
output_path = "s3://{}/{}".format(config.bucket, config.s3_inference_output_folder),
sagemaker_session = sagemaker_session,
base_transform_job_name = config.inference_job_prefix,
max_payload = 100
)
However, I still get an error (through console CloudWatch logs):
413 Request Entity Too Large
The data value transmitted exceeds the capacity limit.
2.
Setting max_payload to 0, which, by specification, Amazon SageMaker should interpret as no limit on the payload size.
In that case the job finishes successfully, but the output file is empty (0 bytes).
Any ideas either what I'm doing wrong, or how to run a bigger batch?
Most of SageMaker algorithms set their own default execution parameters with 6 MB in MaxPayloadInMB, so if you are getting 413 from SageMaker algorithms, you are likely to be exceeding the maximum payload they can support. Assuming each row is less than 6 MB in the file, you can fix this by leaving MaxPayloadInMB unset to fallback to the algorithm's default size and setting SplitType to "Line" instead, so it can split the data into smaller batches (https://docs.aws.amazon.com/sagemaker/latest/dg/API_TransformInput.html#SageMaker-Type-TransformInput-SplitType).
this helped me resolve the issue by setting strategy='SingleRecord' in the transformer + you can also add a stronger instance via instance_type and distribute via instance_count.
I have tried the above solutions, but unfortunately they didn't work for me.
Here is what worked for me: https://stackoverflow.com/a/55920737/7091978
Basically, I set "max_payload" from 0 to 1.

Tensorflow batching without extra None dimension?

Is it possible to do batching in tensorflow without expanding the placeholder size by an extra dimension of None? Specifically I'd just like to feed multiple samples via the placeholders through feed_dict. The code base I'm working on would require a large amount of change to the code to account for adding an extra dimension for the batch size.
eg:
sess.run(feed_dict={var1:val1values, var2: val2values, ...})
Where val1values would represent a batch of size X instead of just one training sample.
The shape information including the number of dimensions is available to Python code to do arbitrary things with, and does affect the ops added to the graph (like which matmul kernel is used), so there's no general safe way to automatically add a batch dimension. Something like labeled_tensor may make code slightly less confusing to refactor.

Avoiding exhausting GPU resources in convNN Tensorflow

I'm trying to run a hyperparameter optimization script, for a convNN using Tensorflow.
As you may know, TF handling of the GPU-Memory isn't that fancy(don't think it will ever be, thanks to the TPU). So my question is how do I know to choose the filter dimensions and the batchsize, so that the GPU-memory don't get exhausted.
Here's the equation that I'm thinking of:
image_shape =128x128x3(3 color channel)
batchSitze = 20 ( is the smallest possible batchsize, since I got 20 klasses)
filter_shape= fw_fh_fd[filter_width=4, filter_height=4, filter_depth=32]
As far as understood, using tf.conv2d function will need the following amount of memory:
image_width * image_height *numerofchannel*batchSize*filter_height*filter_width*filter_depth*32bit
since we're tf.float32 type for each pixel.
in the given example, the needed memory, will be :
128x128x3x20x4x4x32x32 =16106127360 (bits), which is all most 16GB of memory.
I'm not the formula is correct, so I hope to get a validation or the a correction of what I'm missing.
Actually, this will take only about 44MB of memory, mostly taken by the output.
Your input is 20x128x128x3
The convolution kernel is 4x4x3x32
The output is 20x128x128x32
When you sum up the total, you get
(20*128*128*3 + 4*4*3*32 + 20*128*128*32) * 4 / 1024**2 ≈ 44MB
(In the above, 4 is for the size in bytes of float32 and 1024**2 is to get the result in MB).
Your batch size can be smaller than your number of classes. Think about ImageNet and its 1000 classes: people are training with batch sizes 10 times smaller.
EDIT
Here is a tensorboard screenshot of the net — it reports 40MB rather than 44MB, probably because it excludes the input — and you also have all the tensor sizes I mentioned earlier.

Fitting Large Matrix Calculations into Memory when using Tensorflow

I am attempting to build a model which has two phases.
The first takes an input image and passes it through a conv-deconv network. The resulting Tensor has entries corresponding to pixels in a desired output image (same size as the input image).
To calculate the final output image I want to take the value generated at each pixel location from the first phase and use it as an additional input to a reduction function that is applied over the entire input image. This second step has no trainable variables, but it does have computation/memory costs that grow exponentially with the size of the input (each output pixel is a function of all input pixels).
I'm currently using the tf.map_fn to calculate the output image. I'm mapping the output pixel calculation function onto the results from the first phase. My desire is that tensorflow would allocate the memory to store the intermediate tensors needed for each pixel calculation and then free that memory before moving on to the next pixel calculation. But instead it seems to never free the intermediate calculations causing OOM errors.
Is there someway to tell tensorflow (either explicitly or implicitly) that it should free the memory allocated to hold the data of a Tensor that is no longer needed in the calculation?
TensorFlow deallocates memory for the tensor as soon as the tensor is no longer needed for any future calculations. You can verify this by looking at memory deallocation messages as shown in this notebook.
It's possible you are running out of memory because TensorFlow executes nodes in a memory inefficient order.
As an example, consider following computation:
k = 2000
a = tf.random_uniform(shape=(k,k))
for i in range(n):
a = tf.matmul(a, tf.random_uniform(shape=(k,k)))
The order in which it is evaluated can be shown below
All the circles (tf.random_uniform) nodes are evaluated first, followed by squares (tf.matmul). This has O(n) memory requirement compared to O(1) for the optimal order.
You can use control dependencies to force a specific execution order, ie, using helper function as below:
import tensorflow.contrib.graph_editor as ge
def run_after(a_tensor, b_tensor):
"""Force a to run after b"""
ge.reroute.add_control_inputs(a_tensor.op, [b_tensor.op])

Getting each example exactly once

For monitoring my model's performance on my evaluation dataset, I'm using tf.train.string_input_producer for the filenames queue on .tfr files, then I feed the parsed examples to the tf.train.batch function, that produces batches of a fixed size.
Assume my evaluation dataset contains exactly 761 examples (a prime number). To read all the examples exactly once, I have to have a batch size that divides 761, but there is no such, except 1 that will be too slow and 761 that will not fit in my GPU. Any standard way for reading each example exactly once?
Actually, my dataset size is not 761, but there is no number in the reasonable range of 50-300 that divides it exactly. Also I'm working with many different datasets, and finding a number that approximately divides the number of examples in each dataset can be a hassle.
Note that using the num_epochs parameter to tf.train.string_input_producer does not solve the issue.
Thanks!
You can use reader.read_up_to as in this example. Your last batch will be smaller, so you need to make sure your network doesn't hard-wire batch-size anywhere