Migrating TensorFlow serving 1 to 2, RESSOURCE_EXHAUSTED from grpc calls - tensorflow

We are trying to migrate from /tensorflow/serving:1.12 to /tensorflow/serving:2.8.2, we use grpc calls to get the recommendation, we have two types of calls, one for each memeber where the grpc call ask for recommendation for one member, and another type where we sent batches of memebers, with TF1 everything worked fine, now with TF2, for the first type it workes fine, for the batch calls we get an error on the grpc side:
[2022-10-19 16:22:01.454] [ Thread-53] [batch-slave:e4ccd2a7-e0d7-4cc0-9481-b3d98fec3df8] ERROR AgoraRecSysBatchService - Failed while processing : e4ccd2a7-e0d7-4cc0-9481-b3d98fec3df8
io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED
at io.grpc.Status.asRuntimeException(Status.java:535)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:534)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Here our base image
FROM tensorflow/serving:2.8.2
COPY entry.py /usr/bin/entry.py
COPY requirements.txt /requirements.txt
RUN apt update && \
apt install -y python3 python3-pip wget && \
mkdir /bucket && \
chmod +x /usr/bin/entry.py && \
apt autoremove -y && \
rm -rf /var/lib/apt/lists/* && \
pip3 install -r /requirements.txt
RUN wget -P /bin https://github.com/grpc-ecosystem/grpc-health-probe/releases/download/v0.2.0/grpc_health_probe-linux-amd64 && \
mv /bin/grpc_health_probe-linux-amd64 /bin/grpc_health_probe && \
chmod +x /bin/grpc_health_probe
# ENTRYPOINT ["/bin/bash"]
ENTRYPOINT ["/usr/bin/entry.py"]
in Production we have a configured batched size of 100 memebers for each request, and we can have up to 30K memeber in one call.
Client ask for recomendation of 30K members in http -> we get the request in one pod and create mini batch grpc request to tf serving until we get all of them then we respond with an http response of the hole 30k.
Do you have any idea what could be the issue, specially from the exhausted ressources, there is no info that can help as a lead, because we have sufficient pods, and sufficient ressouces, we are using GKE and there is no spike in the pods.
Edit:
Code we use to train with tf2:
def train_step(self, batch):
with tf.GradientTape() as tape:
scores = self(batch, training=True)['similarity']
scores = tf.keras.activations.sigmoid(scores)
pos_scores, neg_scores = self.split_pos_neg(scores)
labels = self.get_classification_labels()
loss = self.compiled_loss(labels, scores)
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
self.histogram(pos_scores, neg_scores)
self.compiled_metrics.update_state(labels, scores)
results = {m.name: m.result() for m in self.metrics}
results['pos_neg_score'] = pos_better_neg(pos_scores, neg_scores)
return results

TF1 in eager mode and TF2 in graph mode, bandwidths elapse when using TF2 with TF1 remote execution but you can use GradientTape.
When changes are not determined by themselves but accumulate, myself held all my values into vanes.
TF1 session required placeholder or tf.Variable.
X = tf.compat.v1.placeholder(tf.float32, shape=( (1, 28, 28)))
y = tf.compat.v1.placeholder(tf.float32, shape=(1,))
loss = tf.reduce_mean(input_tensor=tf.square((X - y)))
optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(loss)
with tf.compat.v1.Session() as sess:
saver = tf.compat.v1.train.Saver()
train_loss, _ = sess.run([loss, training_op], feed_dict={X:list_image, y:list_label})
GradientTape watch variables and calculate accumulate values.
with tf.GradientTape() as tape:
result = self.model( inputs=tf.constant( content_batch[0], shape=( 1, WIDTH, HEIGHT, CHANNEL ) ) )
result = tf.constant( result, shape=( 2, 1 ) )
predict_label = tf.Variable( tf.constant( self.model.trainable_weights[len(self.model.trainable_weights) - 1], shape=( 2, 1 ) ) )
loss_value = self.loss( result.numpy(), current_label )
loss_value = tf.Variable( tf.constant( loss_value, shape=( 1, ) ).numpy() )
tape.watch( loss_value )
gradients = tape.gradient( loss_value, loss_value )
self.optimizer.apply_gradients(zip(gradients, self.model.trainable_weights))

Related

Tensorboard visualization don't appear in google collab

I am implementing a simple linear regression code in google collab and trying to visualize the results with tensorboard with the following command
%tensorboard --logdir=/tmp/lr-train.
However, when I run this command, the tensorboard just simply does not show up. Instead I just see the following message Reusing TensorBoard on port 6012 (pid 3219), started 0:07:06 ago. (Use '!kill 3219' to kill it.)
How to launch the tensorboard in my case? Here is the code I am trying to run:
%tensorflow_version 1.x
import tensorflow as tf
import numpy as np
N = 100
x_zeros = np.random.multivariate_normal(
mean=np.array((-1, -1)), cov = 0.1 * np.eye(2), size = (N//2))
y_zeros = np.zeros((N//2,))
x_ones = np.random.multivariate_normal(mean=np.array((1, 1)), cov = 0.1 * np.eye(2), size=(N//2))
y_ones = np.zeros((N//2))
x_np = np.vstack([x_zeros, x_ones])
y_np = np.concatenate([y_zeros, y_ones])
with tf.name_scope("placeholders"):
x = tf.placeholder(tf.float32, (N, 2))
y = tf.placeholder(tf.float32, (N, 1))
with tf.name_scope("weights"):
W = tf.Variable(tf.random_normal((2, 1)))
b = tf.Variable(tf.random_normal((1, )))
with tf.name_scope("prediction"):
y_pred = tf.matmul(x, W) + b
with tf.name_scope("loss"):
l = tf.reduce_sum((y - y_pred)**2)
with tf.name_scope("optim"):
train_op = tf.train.AdamOptimizer(0.05).minimize(l)
with tf.name_scope("summaries"):
tf.summary.scalar("loss", l)
merged = tf.summary.merge_all()
train_writer = tf.summary.FileWriter('/tmp/lr-train', tf.get_default_graph())
n_steps = 100
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# Train model
for i in range(n_steps):
feed_dict = {x: x_np, y: y_np.reshape(-1,1)}
_, summary, loss = sess.run([train_op, merged, l], feed_dict=feed_dict)
if i%10 == 0:
print("step %d, loss: %f" % (i, loss))
I tried an example in tf-2, and tensorboard launched without any issues with the same command.
I tried your code in Colab and was able to reproduce what you mentioned and found a solution that worked as described below.
Use a ”space” between —logdir and /tmp/lr-train instead of a =.
What did not work as mentioned in the question:
%load_ext tensorboard
%tensorboard --logdir=/tmp/lr-train
What did work:
%load_ext tensorboard
%tensorboard --logdir /tmp/lr-train
You can also terminate your active session and then rerun all cells afterward. But maybe you even found a way of killing a tensorboard within a session.

Why is the deeplab v3+ model confused about pixels outside image boundary?

I'm using the google research github repository to run deeplab v3+ on my dataset to segment parts of a car. The crop size I've used is 513,513 (default) and the code adds a boundary to images smaller than that size (correct me if I'm wrong).
example!
The model seems to be performing poorly on the added boundary. Is there something I'm supposed to correct or will the model do fine with more training ?
Update: Here's the tensorboard graphs for training. Why is the regularization loss shooting like that? The output seems to be improving, can someone help me making inferences from these graphs?
Is there something I'm supposed to correct or will the model do fine with more training ?
its Ok, don't mind the boundary
To inference you can use this code
import cv2
import tensorflow as tf
import numpy as np
from PIL import Image
from skimage.transform import resize
class DeepLabModel():
"""Class to load deeplab model and run inference."""
INPUT_TENSOR_NAME = 'ImageTensor:0'
OUTPUT_TENSOR_NAME = 'SemanticPredictions:0'
INPUT_SIZE = 513
def __init__(self, path):
"""Creates and loads pretrained deeplab model."""
self.graph = tf.Graph()
graph_def = None
# Extract frozen graph from tar archive.
with tf.gfile.GFile(path, 'rb')as file_handle:
graph_def = tf.GraphDef.FromString(file_handle.read())
if graph_def is None:
raise RuntimeError('Cannot find inference graph')
with self.graph.as_default():
tf.import_graph_def(graph_def, name='')
self.sess = tf.Session(graph=self.graph)
def run(self, image):
"""Runs inference on a single image.
Args:
image: A PIL.Image object, raw input image.
Returns:
seg_map: np.array. values of pixels are classes
"""
width, height = image.size
resize_ratio = 1.0 * self.INPUT_SIZE / max(width, height)
target_size = (int(resize_ratio * width), int(resize_ratio * height))
resized_image = image.convert('RGB').resize(target_size, Image.ANTIALIAS)
batch_seg_map = self.sess.run(
self.OUTPUT_TENSOR_NAME,
feed_dict={self.INPUT_TENSOR_NAME: [np.asarray(resized_image)]})
seg_map = batch_seg_map[0]
seg_map = resize(seg_map.astype(np.uint8), (height, width), preserve_range=True, order=0, anti_aliasing=False)
return seg_map
the code is based on this file https://github.com/tensorflow/models/blob/master/research/deeplab/deeplab_demo.ipynb
model = DeepLabModel(your_model_pb_path)
img = Image.open(img_path)
seg_map = model.run(img)
to get your_model_pb_path you need to export your model to .pb file
you can do it using export_model.py file in Deeplab repo
https://github.com/tensorflow/models/blob/master/research/deeplab/export_model.py
if you were training xception_65 version
python3 <path to your deeplab folder>/export_model.py \
--logtostderr \
--checkpoint_path=<your ckpt> \
--export_path="./my_model.pb" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--num_classes=<NUMBER OF YOUR CLASSES> \
--crop_size=513 \
--crop_size=513 \
--inference_scales=1.0
<your ckpt> is a path to your trained model checkpoint you can find checkpoints in the folder that you passed as argument --train_logdir when training
you need to include only model name and number of iterations in path, or in other words you will have in your training folder, for example, files
model-1500.meta, model-1500.index and model-1000.data-00000-of-00001 you need to discard everything that goes after ., so the ckpt path will be model-1000
please make sure that atrous_rates are the same as you used to train the model
if you were training mobilenet_v2 version
python3 <path to your deeplab folder>/export_model.py \
--logtostderr \
--checkpoint_path=<your ckpt> \
--export_path="./my_model.pb" \
--model_variant="mobilenet_v2" \
--num_classes=<NUMBER OF YOUR CLASSES> \
--crop_size=513 \
--crop_size=513 \
--inference_scales=1.0
more you can find here
https://github.com/tensorflow/models/blob/master/research/deeplab/local_test_mobilenetv2.sh
https://github.com/tensorflow/models/blob/master/research/deeplab/local_test.sh
You can visualize results using this code
img_arr = np.array(img)
# as may colors as you have classes
colors = [(255, 0, 0), (0, 255, 0), ...]
for c in range(0, N_CLASSES):
img_arr[seg_map == c] = 0.5 * img_arr[seg_map == c] + 0.5 * np.array(colors[c])
cv2.imshow(img_arr)
cv2.waitKey(0)

Colaboratory VM restarts like clockwork every 45 minutes

I'm running Python 2 on a GPU-enabled instance. I am training an LSTM and saving it every 10 cycles. Without fail, the VM restarts every 45 minutes (just before 50 cycles are completed). This has been happening for several days, both on my home wifi (Comcast) and work wifi. I suspect the problem is something native to Google's settings or the Notebook settings, but I can't find anything to tweak this.
My question is: has anyone encountered this? How did you resolve it?
I've included my code here, but I don't think this is code related. It's failing in the last if epoch % epoch_saving_period ... block.
pickle.dump((seq_length, save_dir), open('params.p', 'wb'))
batches = get_batches(corpus_int, batch_size, seq_length)
num_batches = len(batches)
start_time = time.time()
print "Process started"
last_checkpoint_prefix = '/tmp/pretrained.ckpt-' + str(last_epoch)
tf.reset_default_graph()
with tf.Session(graph=train_graph) as sess:
session_config=config
saver = tf.train.Saver(tf.global_variables())
#tf.add_to_collection('train_op', train_op)
# If you're loading in a saved model, use the following
if (last_epoch > 0):
#saver = tf.train.import_meta_graph(last_checkpoint_prefix + '.meta')
saver.restore(sess, tf.train.latest_checkpoint('/tmp/'))
sess.run(tf.local_variables_initializer())
else:
# If you're running a fresh session, use the following
sess.run(tf.global_variables_initializer())
input_text = train_graph.get_tensor_by_name('input:0')
initial_state = train_graph.get_tensor_by_name('initial_state:0')
final_state = train_graph.get_tensor_by_name('final_state:0')
probs = train_graph.get_tensor_by_name('probs:0')
targets = train_graph.get_tensor_by_name('targets:0')
lr = train_graph.get_tensor_by_name('learning_rate:0')
#init_from_checkpoint('/tmp/pretrained.ckpt', {'input': 'input',
# 'final_state': 'initial_state',
# 'targets': 'targets',
# 'learning_rate': 'learning_rate'})
epochList = []
lossList = []
epoch_saving_period = 10
epoch = 0
for epoch in range(last_epoch,(last_epoch+num_epochs)):
state = sess.run(initial_state, {input_text: batches[0][0]})
for batch_index, (x, y) in enumerate(batches):
feed_dict = {
input_text: x,
targets: y,
initial_state: state,
lr * math.exp(-1 * epoch / 1000): learning_rate
}
train_loss, state, _ = sess.run([cost, final_state, train_op], feed_dict)
time_elapsed = time.time() - start_time
print('Epoch {:>3} Batch {:>4}/{} train_loss = {:.3f} time_elapsed = {:.3f}'.format(
epoch + 1,
batch_index + 1,
len(batches),
train_loss,
time_elapsed
#((num_batches * num_epochs)/((epoch + 1) * (batch_index + 1))) * time_elapsed - time_elapsed),
))
epochList.append(epoch)
lossList.append(train_loss)
# save model every 10 epochs
if epoch % epoch_saving_period == 0:
last_epoch = epoch - epoch_saving_period
#saver = tf.train.Saver()
#saver.save(sess, save_dir)
savePath = saver.save(sess, "/tmp/pretrained.ckpt", global_step=epoch, write_meta_graph = True)
# Copy the file to our new bucket.
# Full reference: https://cloud.google.com/storage/docs/gsutil/commands/cp
!gsutil cp /tmp/checkpoint gs://{bucket_name}/
!gsutil cp /tmp/pretrained.ckpt-{epoch}.index gs://{bucket_name}/
!gsutil cp /tmp/pretrained.ckpt-{epoch}.meta gs://{bucket_name}/
!gsutil cp /tmp/pretrained.ckpt-{epoch}.data-00000-of-00001 gs://{bucket_name}/
!gsutil rm gs://{bucket_name}/pretrained.ckpt-{last_epoch}.index
!gsutil rm gs://{bucket_name}/pretrained.ckpt-{last_epoch}.meta
!gsutil rm gs://{bucket_name}/pretrained.ckpt-{last_epoch}.data-00000-of-00001
print('Model Trained and Saved')
if epoch % 5 == 0:
plt.plot(epochList, lossList)
plt.title('Train Loss')
plt.show()

launching tensorboard from google cloud datalab

I need help in luanching tensorboard from tensorflow running on the datalab,
My code is the followings (everything is on the datalab):
import tensorflow as tf
with tf.name_scope('input'):
print ("X_np")
X_np = tf.placeholder(tf.float32, shape=[None, num_of_features],name="input")
with tf.name_scope('weights'):
print ("W is for weights & - 15 number of diseases")
W = tf.Variable(tf.zeros([num_of_features,15]),name="W")
with tf.name_scope('biases'):
print ("b")
#todo:authemate for more diseases
b = tf.Variable(tf.zeros([15]),name="biases")
with tf.name_scope('layer'):
print ("y_train_np")
y_train_np = tf.nn.softmax(tf.matmul(X_np,W) + b)
with tf.name_scope('correct'):
print ("y_ - placeholder for correct answer")
y_ = tf.placeholder(tf.float32, shape=[None, 15],name="correct_answer")
with tf.name_scope('loss'):
print ("cross entrpy")
cross_entropy = -tf.reduce_sum(y_*tf.log(y_train_np))
# % of correct answers found in batch
print("is correct")
is_correct = tf.equal(tf.argmax(y_train_np,1),tf.argmax(y_,1))
print("accuracy")
accuracy = tf.reduce_mean(tf.cast(is_correct,tf.float32))
print("train step")
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
# train data and get results for batches
print("initialize all varaible")
init = tf.global_variables_initializer()
print("session")
sess = tf.Session()
writer = tf.summary.FileWriter("logs/", sess.graph)
init = tf.global_variables_initializer()
sess.run(init)
!tensorboard --logdir=/logs
the output is:
Starting TensorBoard 41 on port 6006
(You can navigate to http://172.17.0.2:6006)
However, when I click on the link, the webpage is empty
Please let me know what I am missing. I am expecting to see the graph. later i would like to generate more data. Any suggestion is appreciated.
Many thanks!
If you are using datalab, you can use tensorboard as below:
from google.datalab.ml import TensorBoard as tb
tb.start('./logs')
http://googledatalab.github.io/pydatalab/google.datalab.ml.html
You can also create a Cloud AI Platform Notebook instance with TensorBoard support by entering the following command into the Cloud Shell. Afterwards you can simply launch tensorboard when you want from launcher (File->New Launcher-> Tensorboard)
export IMAGE_FAMILY="tf-1-14-cpu"
export ZONE="us-west1-b"
export INSTANCE_NAME="tf-tensorboard-1"
export INSTANCE_TYPE="n1-standard-4"
gcloud compute instances create "${INSTANCE_NAME}" \
--zone="${ZONE}" \
--image-family="${IMAGE_FAMILY}" \
--image-project=deeplearning-platform-release \
--machine-type="${INSTANCE_TYPE}" \
--boot-disk-size=200GB \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--metadata="proxy-mode=project_editors

I got a error when running a github project in tensorflow

DCGAN
when I run the project , i got the error.
ValueError: Variable d_h0_conv/w/Adam/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?
the part of code is below.
the optimizer:
d_optim = tf.train.AdamOptimizer(config.learning_rate, beta1=config.beta1) \
.minimize(self.d_loss, var_list= self.d_vars)
g_optim = tf.train.AdamOptimizer(config.learning_rate, beta1=config.beta1) \
.minimize(self.g_loss, var_list= self.g_vars)
the variables:
self.d_vars = [var for var in t_vars if 'd_' in var.name]
self.g_vars = [var for var in t_vars if 'g_' in var.name]
the operation:
def conv2d(input_, output_dim,
k_h=5, k_w=5, d_h=2, d_w=2, stddev=0.02,
name="conv2d"):
with tf.variable_scope(name):
w = tf.get_variable('w', [k_h, k_w, input_.get_shape()[-1], output_dim],
initializer=tf.truncated_normal_initializer(stddev=stddev))
conv = tf.nn.conv2d(input_, w, strides=[1, d_h, d_w, 1], padding='SAME')
biases = tf.get_variable('biases', [output_dim], initializer=tf.constant_initializer(0.0))
conv = tf.reshape(tf.nn.bias_add(conv, biases), conv.get_shape())
return conv
Environment:
ubuntu14.04 , python2.7 tensorflow 0.12
Thank you for help.
I need help.
I assume you were were running the command to train the network after pulling the data.
I was able to clone the project, pull the image data sets, and run the training command using Python 3.5 on Ubuntu w/Tensorflow 0.12. The commands are only slightly different
(e.g. python3 main.py --dataset mnist --is_train True vs python...)
I know this project support python 2.7 but you able to run the project using python3?