Is Tensorflow Serving example expected to have 10% error rate? - tensorflow-serving

I followed the docs Building an optimized serving binary then Testing the development environment, and I got
Inference error rate: 10.4%
Is it expected that a fresh install of a release build of TensorFlow Serving would give a 10% error rate on the provided example model?
My environment:
AWS EC2
OS: Amazon Linux AMI release 2018.03
Instance Type: r5.large
Steps to reproduce:
# download tensorflow serving code
git clone https://github.com/tensorflow/serving
cd serving
# build optimized serving binary
docker build --pull -t $USER/tensorflow-serving-devel -f tensorflow_serving/tools/docker/Dockerfile.devel .
# run & open shell for generated docker image
docker run -it -p 8600:8600 ec2-user/tensorflow-serving-devel:latest
# train the mnist model
python tensorflow_serving/example/mnist_saved_model.py /tmp/mnist_model
# serve the model
tensorflow_model_server --port=8500 --model_name=mnist --model_base_path=/tmp/mnist_model/ &
# test the client
python tensorflow_serving/example/mnist_client.py --num_tests=1000 --server=localhost:8500

The example mnist_saved_model.py as part of tensorflow_serving's example is more focused on speed to create a model and a simple example on how to save a model then accuracy.
In https://www.tensorflow.org/tfx/serving/serving_advanced, it shows that when the above code is trained with 100 iterations it has a 13.1% error rate, and when trained with 2000 iterations it has a 9.5% error rate.
The default if --training_iteration is not specified is 1000, so your 10.4 error rate is inline with these results.
You will find this mnist model provides better accuracy (and takes much longer to train): https://github.com/tensorflow/models/tree/master/official/mnist
This model will work with the slight changes to the mnist_client.py example.
try this:
train the mnist model
git clone https://github.com/tensorflow/models
export PYTHONPATH="$PYTHONPATH:$PWD/models"
pushd models/official/mnist/
python mnist.py --export_dir /tmp/mnist_model
serve the model
tensorflow_model_server --port=8500 --model_name=mnist --model_base_path=/tmp/mnist_model/ &
switch back the original directory
popd
make the follow changes to mnist_client.py to work with the new model
diff --git a/tensorflow_serving/example/mnist_client.py b/tensorflow_serving/example/mnist_client.py
index 592969d..85ef0bf 100644
--- a/tensorflow_serving/example/mnist_client.py
+++ b/tensorflow_serving/example/mnist_client.py
## -112,8 +112,8 ## def _create_rpc_callback(label, result_counter):
sys.stdout.write('.')
sys.stdout.flush()
response = numpy.array(
- result_future.result().outputs['scores'].float_val)
- prediction = numpy.argmax(response)
+ result_future.result().outputs['classes'].int64_val)
+ prediction = response[0]
if label != prediction:
result_counter.inc_error()
result_counter.inc_done()
## -143,9 +143,9 ## def do_inference(hostport, work_dir, concurrency, num_tests):
for _ in range(num_tests):
request = predict_pb2.PredictRequest()
request.model_spec.name = 'mnist'
- request.model_spec.signature_name = 'predict_images'
+ request.model_spec.signature_name = 'classify'
image, label = test_data_set.next_batch(1)
- request.inputs['images'].CopyFrom(
+ request.inputs['image'].CopyFrom(
tf.contrib.util.make_tensor_proto(image[0], shape=[1, image[0].size]))
result_counter.throttle()
result_future = stub.Predict.future(request, 5.0) # 5 seconds
test the client
python tensorflow_serving/example/mnist_client.py --num_tests=1000 --server=localhost:8500
Inference error rate: 0.8%

Is Tensorflow Serving an example expected to have a 10% error rate?
Yes, this particular example is expected to have a 10% error rate as the accuracy of this model on training and testing data is almost the same (around 90%) i.e., this is a very basic neural network as shown here.
If you want a good prediction accuracy, you might have to use resnet_client.py or you can actually add more layers and tune the hyper-parameters to get higher prediction accuracy or lesser inference error rate.
A tutorial on how to use the resent model to serve is given here. This should give you a much lesser inference error rate.

Related

Converting tf.keras model to TFLite: Model is slow and doesn't work with XNN Pack

Until recently I had been training a model using TF 1.15 based on MobileNetV2.
After training I had always been able to run these commands to generate a TFLite version:
tf.keras.backend.set_learning_phase(0)
converter = tf.lite.TFLiteConverter.from_keras_model_file(
tf_keras_path)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [
tf.lite.constants.FLOAT16]
tflite_model = converter.convert()
The resulting model was fast enough for our needs and when our Android developer used XNN Pack, we got an extra 30% reduction in inference time.
More recently I've developed a replacement model using TF2.4.1, based on the built-in keras implementation of efficientnet-b2.
This new model has larger input image size ((260,260) vs (224,224)) and its keras inference time is about 1.5x that of the older model.
However, when I convert to TFLite using these commands:
converter = tf.lite.TFLiteConverter.from_keras_model(newest_v3)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_model = converter.convert()
there are a number of problems:
inference time is 5x slower than the old model
on Android our developer sees this error: "Failed to apply XNNPACK delegate: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors."
When I run the conversion I can see a message that says: "function_optimizer: function_optimizer did nothing. time = 9.854ms."
I have also tried saving as SavedModel and converting the saved model.
Another attempt I made was to use the command line tool with these arguments (and as far as I recall, pretty much every permutation of arguments possible):
tflite_convert --saved_model_dir newest_v3/ \
--enable_v1_converter \
--experimental_new_converter True \
--input-shape=1,260,260,3 \
--input-array=input_1:0 \
--post_training_quantize \
--quantize_to_float16 \
--output_file newest_v3d.tflite\
--allow-custom-ops
If anyone can shed some light onto what's going on here I'd be very grateful.
Tensorflow-lite does currently support tensors with dynamic shapes(enabled by default and explicitly by the "experimental_new_converter True" option on your conversion) but this issue below points out that XNNPack does not:
https://github.com/tensorflow/tensorflow/issues/42491
As XNNPack is not able to optimize the graph of the EfficientNet model you are not getting a boost in performance making the inference 5 times slower than before instead of just around 1.5 times.
Personally I would just recommend to move to EfficientNet-lite as it's the mobile/TPU counterpart of EfficientNet and was designed taking into account the restricted sets of operations available in Tensorflow-lite:
https://blog.tensorflow.org/2020/03/higher-accuracy-on-vision-models-with-efficientnet-lite.html

Difference in Performance between Cloud Compute VM and AI Platform

I have a GCP cloud compute VM, which is an n1-standard-16, with 4 P100 GPUs attached, and a solid state drive for storing data. I'll refer to this as "the VM".
I've previously used the VM to train a tensorflow based CNN. I want to move away from this to using AI Platform so I can run multiple jobs simultaneously. However I've run into some problems.
Problems
When the training is run on the VM I can set a batch size of 400, and the standard time for an epoch to complete is around 25 minutes.
When the training is running on a complex_model_m_p100 AI platform machine, which I believe to be equivalent to the VM, I can set a maximum batch size of 128, and the standard time for an epoch to complete is 1 hour 40 minutes.
Differences: the VM vs AI Platform
The VM uses TF1.12 and AI Platform uses TF1.15. Consequently there is a difference in GPU drivers (CUDA 9 vs CUDA 10).
The VM is equipped with a solid state drive, which I don't think is the case for AI platform machines.
I want to understand the cause of the reduced batch size, and decrease the epoch times on AI Platform to comparable levels to Glamdring. Has anyone else run into this issue? Am I running on the correct kind of AI Platform machine? Any advice would be welcome!
Could be a bunch of stuff. There's two ways to go about, making the VM look more like AI Platform:
export IMAGE_FAMILY="tf-latest-gpu" # 1.15 instead of 1.12
export ZONE=...
export INSTANCE_NAME=...
gcloud compute instances create $INSTANCE_NAME \
--zone=$ZONE \
--image-family=$IMAGE_FAMILY \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--metadata="install-nvidia-driver=True"
n and then attach 4 GPUs after that.
... or making AI Platform looking more like the VM:
https://cloud.google.com/ai-platform/training/docs/machine-types#gpus-and-tpus,
because you are using a Legacy Machine right now.
After following the advice of #Frederik Bode and creating a replica VM with TF 1.15 and associated drivers installed I've managed to solve my problem.
Rather than using the multi_gpu_model function call within tf.keras, it's actually best to use a distributed strategy and run the model within that scope.
There is a guide describing how to do it here.
Essentially now my code looks like this:
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
training_dataset, validation_dataset = get_datasets()
model = setup_model()
# Don't do this, it's not necessary!
#### NOT NEEDED model = tf.keras.utils.multi_gpu_model(model, 4)
opt = tf.keras.optimizers.Adam(learning_rate=args.learning_rate)
model.compile(loss='sparse_categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
steps_per_epoch = args.steps_per_epoch
validation_steps = args.validation_steps
model.fit(training_dataset, steps_per_epoch=steps_per_epoch, epochs=args.num_epochs,
validation_data=validation_dataset, validation_steps=validation_steps)
I setup a small dataset so I could rapidly prototype this.
With a single P100 GPU the epoch time average to 66 seconds.
With 4 GPUs, using the code above, the averagge epoch time was 19 seconds.

Why would Tensorflow Serving run a model slower than Tensorflow on a Keras stock model

I've been attempting to deploy a machine learning solution with Tensorflow Serving on an embedded device (Jetson Xavier [ARMv8])
One model used by the solution is a stock Xception network, generated by:
xception = tf.keras.applications.Xception(include_top=False, input_shape=(299, 299, 3), pooling=None
xception.save("./saved_xception_model/1", save_format="tf")
Running the Xception model on the device generates reasonable performance - about 0.1s to predict, ignoring all processioning:
xception = tf.keras.models.load_model("saved_xception_model/1", save_format="tf")
image = get_some_image() # image is numpy.ndarray
image.astype("float32")
image /= 255
image = cv2.resize(image, (299, 299))
# Tensorflow predict takes ~0.1s
xception.predict([image])
However, once the model is running in a Tensorflow Serving GPU container, via Nvidia-Docker, the model is much slower - about 3s to predict.
I've been trying to isolate the cause of the poor performance, and I've run out of ideas.
So far I've tested:
Tweaking TF Serving's batching parameters to go all out on latency
(batch_timeout_micros: 0, max_batch_size: 1, and noticed a
modest 0.5s gain in performance.
Optimizing the model with TensorRT via saved_model_cli.
Running the Xception model in isolation, as the only model being served by TF Serving.
Experimenting with doubling the memory allocated per TF process.
Experimenting with enabling and disabling batching altogether.
Experimenting with enabling and disabling model warmup.
I would expect TF Serving to provide the same (more or less, allowing for GRPC encoding and decoding) prediction time as TF, and they do for other models I'm running. None of my efforts have got Xception upto the ~0.1s performance I would expect.
My install of Tensorflow is built by Nvidia, from TF version 2.0.
My TF Serving container is self-built from the TF-Serving 2.0 source, with GPU support.
I start a Tensorflow Serving container as follows:
tf_serving_cmd = "docker run --runtime=nvidia -d"
tf_serving_cmd += " --name my-container"
tf_serving_cmd += " -p=8500:8500 -p=8501:8501"
tf_serving_cmd += " --mount=type=bind,source=/home/xception_model,target=/models/xception_model"
tf_serving_cmd += " --mount=type=bind,source=/home/model_config.pb,target=/models/model_config.pb"
tf_serving_cmd += " --mount=type=bind,source=/home/batching_config.pb,target=/models/batching_config.pb"
# Self built TF serving image for Jetson Xavier, ARMv8.
tf_serving_cmd += " ${MY_ORG}/serving"
# I have tried 0.5 with no performance difference.
# TF-Serving does not complain it wants more memory in either case.
tf_serving_cmd += " --per_process_gpu_memory_fraction:0.25"
tf_serving_cmd += " --model_config_file=/models/model_config.pb"
tf_serving_cmd += " --flush_filesystem_caches=true"
tf_serving_cmd += " --enable_model_warmup=true"
tf_serving_cmd += " --enable_batching=true"
tf_serving_cmd += " --batching_parameters_file=/models/batching_config.pb"
I'm at the point where I'm starting to wonder if this is a bug in TF-Serving, although I have no idea where (Yeah, I know it's never a bug, it's always the user...)
Can anyone suggest why TF-Serving might underperform compared to TF?

Can't use Keras CSVLogger callbacks in Sagemaker script mode. It fails to write the log file on S3 ( error - No such file or directory )

I have this script where I want to get the callbacks to a separate CSV file in sagemaker custom script docker container. But when I try to run in local mode, it fails giving the following error. I have a hyper-parameter tuning job(HPO) to run and this keeps giving me errors. I need to get this local mode run correctly before doing the HPO.
In the notebook I use the following code.
from sagemaker.tensorflow import TensorFlow
tf_estimator = TensorFlow(entry_point='lstm_model.py',
role=role,
code_location=custom_code_upload_location,
output_path=model_artifact_location+'/',
train_instance_count=1,
train_instance_type='local',
framework_version='1.12',
py_version='py3',
script_mode=True,
hyperparameters={'epochs': 1},
base_job_name='hpo-lstm-local-test'
)
tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path})
In my lstm_model.py script the following code is used.
lgdir = os.path.join(model_dir, 'callbacks_log.csv')
csv_logger = CSVLogger(lgdir, append=True)
regressor.fit(x_train, y_train, batch_size=batch_size,
validation_data=(x_val, y_val),
epochs=epochs,
verbose=2,
callbacks=[csv_logger]
)
I tried creating a file before hand like shown below using tensorflow backend. But it doesn't create a file. ( K : tensorflow Backend, tf: tensorflow )
filename = tf.Variable(lgdir , tf.string)
content = tf.Variable("", tf.string)
sess = K.get_session()
tf.io.write_file(filename, content)
I can't use any other packages like pandas to create the file as the TensorFlow docker container in SageMaker for custom scripts doesn't provide them. They give only a limited amount of packages.
Is there a way I can write the csv file to the S3 bucket location, before the fit method try to write the callback. Or is that the solution to the problem? I am not sure.
If you can even suggest other suggestions to get callbacks, I would even accept that answer. But it should be worth the effort.
This docker image is really narrowing the scope.
Well for starters, you can always make your own docker image using the Tensorflow image as a base. I work in Tensorflow 2.0 so this will be slightly different for you but here is an example of my image pattern:
# Downloads the TensorFlow library used to run the Python script
FROM tensorflow/tensorflow:2.0.0a0 # you would use the equivalent for your TF version
# Contains the common functionality necessary to create a container compatible with Amazon SageMaker
RUN pip install sagemaker-containers -q
# Wandb allows us to customize and centralize logging while maintaining open-source agility
RUN pip install wandb -q # here you would install pandas
# Copies the training code inside the container to the design pattern created by the Tensorflow estimator
# here you could copy over a callbacks csv
COPY mnist-2.py /opt/ml/code/mnist-2.py
COPY callbacks.py /opt/ml/code/callbacks.py
COPY wandb_setup.sh /opt/ml/code/wandb_setup.sh
# Set the login script as the entry point
ENV SAGEMAKER_PROGRAM wandb_setup.sh # here you would instead launch lstm_model.py
I believe you are looking for a pattern similar to this, but I prefer to log all of my model data using Weights and Biases. They're a little out of data on their SageMaker integration but I'm actually in the midst of writing an updated tutorial for them. It should certainly be finished this month and include logging and comparing runs from hyperparameter tuning jobs

How can I train with my own dataset with darkflow?

I'm a beginner with some programming experince. I'm trying to train darkflow with my own dataset. I'm following these instructions.
https://github.com/thtrieu/darkflow
So far I have done the following steps.
installed darkflow and the relevant modules
created test images and made annotations (Pascal VOC).
https://ibb.co/y4HmtGz
https://ibb.co/GkxLshK
If I have understood correctly the darkflow training requires Pascal VOC?
My problem is that I don't know how to start the training. How can I start the training process and how can I test if the neuralnet is working? Am I supposed to get weights as a result of training?
You can choose to use pre-trained weights from here. Download cfg and weights.
Assuming you have darkflow installed, you can train your network like this:
flow --model cfg/<your-config-filename>.cfg --load bin/<filename>.weights --train --annotation train/Annotations --dataset train/Images --epoch 100 --gpu 1.0
If you want to train your network from scratch w/o using any pre-trained weights,
you can do this:
flow --model cfg/<your-config-filename>.cfg --train --annotation train/Annotations --dataset train/Images --epoch 100 --gpu 1.0
After the training starts, model checkpoints are saved inside ckpt directory. You can load latest checkpoint and test on sample images.