I deployed LaBSE model to AI Platform in the last past few days.
The issue I encounter is the answer of the request is above the limit of 2MB.
Several ideas I had to improve the situation:
make AI Platform return minified (not beautifully formatted) JSON (without spaces and newlines everywhere
make AI Plateform return the results in a binary format
since the answer is composed of ~13 outputs : change it to only one output
Do you know any ways of doing 1) or 2) ?
I spent lost of efforts on 3). I'm sure this one is possible. For example by editing the network before uploading it. Here are stuff I tried so far for that:
VERSION = 'v1'
MODEL = 'labse_2_b'
MODEL_DIR = BUCKET + '/' + MODEL
# Download the model
! wget 'https://tfhub.dev/google/LaBSE/2?tf-hub-format=compressed' \
--output-document='{MODEL}.tar.gz'
! mkdir {MODEL}
! tar -xzvf '{MODEL}.tar.gz' -C {MODEL}
# Attempts to load the model, edit it, and save it:
model.save(export_path, save_format='tf') # ValueError: Model <keras.engine.sequential.Sequential object at 0x7f87e833c650>
# cannot be saved because the input shapes have not been set.
# Usually, input shapes are automatically determined from calling
# `.fit()` or `.predict()`.
# To manually set the shapes, call `model.build(input_shape)`.
model.build(input_shape=(None,)) # cannot find a proper shape
# create a AI Plateform model version:
! gsutil -m cp -r '{MODEL}' {MODEL_DIR} # upload model to Google Cloud Storage
! gcloud ai-platform versions create $VERSION \
--model {MODEL} \
--origin {MODEL_DIR} \
--runtime-version=2.1 \
--framework='tensorflow' \
--python-version=3.7 \
--region="{REGION}"
Could some please help with with that?
Thanks a lot in advance,
EDIT :
For those wondering about this limitation, as in the comments below : Here are some complementary pieces of information:
A short sentence as
"I wish you a pleasant flight and a good meal aboard this plane."
which is just 16 parts of words long:
[101, 146, 34450, 15100, 170, 147508, 48088, 14999, 170, 17072, 66369, 351617, 15272, 69746, 119, 102]
cannot be processed:
Response size too large. Received at least 3220082 bytes; max is 2000000.". Details: "Response size too large. Received at least 3220082 bytes; max is 2000000.
Related
As what I know, when I running tensorflow model on python script I could use the follow code snippet to profile the timeline of each block in the model.
from tensorflow.python.client import timeline
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
batch_positive_score = sess.run([positive_score], feed_dict, options=options, run_metadata=run_metadata)
fetched_timeline = timeline.Timeline(run_metadata.step_stats)
chrome_trace = fetched_timeline.generate_chrome_trace_format()
with open('./result/timeline.json', 'w') as f:
f.write(chrome_trace)
But how to profile a model that loading on tensorflow-serving?
I think you can use tf.profiler, even during Serving because, it is finally a Tensorflow Graph and the changes made during Training (including Profiling, as per my understanding) will be reflected in Serving as well.
Please find the below Tensorflow Code:
# User can control the tracing steps and
# dumping steps. User can also run online profiling during training.
#
# Create options to profile time/memory as well as parameters.
builder = tf.profiler.ProfileOptionBuilder
opts = builder(builder.time_and_memory()).order_by('micros').build()
opts2 = tf.profiler.ProfileOptionBuilder.trainable_variables_parameter()
# Collect traces of steps 10~20, dump the whole profile (with traces of
# step 10~20) at step 20. The dumped profile can be used for further profiling
# with command line interface or Web UI.
with tf.contrib.tfprof.ProfileContext('/tmp/train_dir',
trace_steps=range(10, 20),
dump_steps=[20]) as pctx:
# Run online profiling with 'op' view and 'opts' options at step 15, 18, 20.
pctx.add_auto_profiling('op', opts, [15, 18, 20])
# Run online profiling with 'scope' view and 'opts2' options at step 20.
pctx.add_auto_profiling('scope', opts2, [20])
# High level API, such as slim, Estimator, etc.
train_loop()
After that, we can run the below mentioned commands in the command prompt:
bazel-bin/tensorflow/core/profiler/profiler \
--profile_path=/tmp/train_dir/profile_xx
tfprof> op -select micros,bytes,occurrence -order_by micros
# Profiler ui available at: https://github.com/tensorflow/profiler-ui
python ui.py --profile_context_path=/tmp/train_dir/profile_xx
Code for Visualizing Time and Memory:
# The following example generates a timeline.
tfprof> graph -step -1 -max_depth 100000 -output timeline:outfile=<filename>
generating trace file.
******************************************************
Timeline file is written to <filename>.
Open a Chrome browser, enter URL chrome://tracing and load the timeline file.
******************************************************
Attribute TensorFlow graph running time to your Python codes:
tfprof> code -max_depth 1000 -show_name_regexes .*model_analyzer.*py.* -select micros -account_type_regexes .* -order_by micros
Show your model variables and the number of parameters:
tfprof> scope -account_type_regexes VariableV2 -max_depth 4 -select params
Show the most expensive operation types:
tfprof> op -select micros,bytes,occurrence -order_by micros
Auto-profile:
tfprof> advise
For more detailed information on this , you can refer the below links:
Understand all the classes mentioned in this page =>
https://www.tensorflow.org/api_docs/python/tf/profiler
Code is given in detail in the below link:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/profiler/README.md
in the flowers tutorial by google here: https://cloud.google.com/ml-engine/docs/tensorflow/flowers-tutorial
For preproccessing of data we used the dollwoing command:
python trainer/preprocess.py \
--input_dict "$DICT_FILE" \
--input_path "gs://cloud-ml-data/img/flower_photos/train_set.csv" \
--output_path "${GCS_PATH}/preproc/train" \
--cloud
I understand we could replace the csv file with our own list and hence train with a different set of images, however creating a csv files for over a 100 types of images will be cumbersome, is there a way to overcome this?
The train_set.csv is a list of file paths in Google Cloud Storage and the prediction label.
This is a part of the file:
gs://cloud-ml-data/img/flower_photos/daisy/754296579_30a9ae018c_n.jpg,daisy
gs://cloud-ml-data/img/flower_photos/dandelion/18089878729_907ed2c7cd_m.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/dandelion/284497199_93a01f48f6.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/dandelion/3554992110_81d8c9b0bd_m.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/daisy/4065883015_4bb6010cb7_n.jpg,daisy
gs://cloud-ml-data/img/flower_photos/roses/7420699022_60fa574524_m.jpg,roses
gs://cloud-ml-data/img/flower_photos/dandelion/4558536575_d43a611bd4_n.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/daisy/7568630428_8cf0fc16ff_n.jpg,daisy
gs://cloud-ml-data/img/flower_photos/tulips/7064813645_f7f48fb527.jpg,tulips
gs://cloud-ml-data/img/flower_photos/sunflowers/4933229095_f7e4218b28.jpg,sunflowers
gs://cloud-ml-data/img/flower_photos/daisy/14523675369_97c31d0b5b.jpg,daisy
gs://cloud-ml-data/img/flower_photos/sunflowers/21518663809_3d69f5b995_n.jpg,sunflowers
gs://cloud-ml-data/img/flower_photos/dandelion/15782158700_3b9bf7d33e_m.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/tulips/8713398906_28e59a225a_n.jpg,tulips
gs://cloud-ml-data/img/flower_photos/tulips/6770436217_281da51e49_n.jpg,tulips
gs://cloud-ml-data/img/flower_photos/dandelion/8754822932_948afc7cef.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/daisy/22873310415_3a5674ec10_m.jpg,daisy
gs://cloud-ml-data/img/flower_photos/sunflowers/5967283168_90dd4daf28_n.jpg,sunflowers
So you will have to collect a set of images for your own train set, upload it to the GCS and clasify them. Then you just have to retrieve the list of path (it could be easily achieve using gsutil ls command) and concatenate with the classification label.
I'm currently working on a mobilenet pre-trained network which I would like to re-train with a dataset which contains png images.
I call the retrain script as follow :
python scripts/retrain.py
--bottleneck_dir=tf_files/bottlenecks
--how_many_training_steps=200
--model_dir=tf_files/models/
--summaries_dir=tf_files/training_summaries/"mobilenet_0.50_224"
--output_graph=tf_files/retrained_graph.pb
--output_labels=tf_files/retrained_labels.txt
--architecture mobilenet_0.50_224
--image_dir=tf_files/data
It seems like the images needs to be jpg, is it any way to work with png images instead ?
Can confirm it doesn't work with png files. I have, however, written a bash script that when placed in the same directory as the subclasses of the dataset can convert the images to jpg.
first you need to install imagemagick package by:
sudo apt-get install imagemagick
then you can run this script:
#!/bin/bash
for d in */ ; do
cd "$d"
for p in * ; do
IFS='.' read -r -a array <<< "$p"
convert "$p" "${array[0]}".jpg
done
cd ..
done
edit:
retrain.py does have a list with valid extensions (line 151):
extensions = ['jpg', 'jpeg', 'JPG', 'JPEG']
I didn't try to add 'png' to the list though
I am new to Floydhub. I am trying to run the code from this github repository and the corresponding tutorial.
For the training, I successfully used this command:
floyd run --gpu --env tensorflow-1.2 --data janinanu/dataset
/data/2:tut_train 'python udc_train.py'
I adjusted this line in the training file to work in Floydhub:
tf.flags.DEFINE_string("input_dir", "/tut_train", "Directory containing
input data files 'train.tfrecords' and 'validation.tfrecords'")
As said, this worked without problems for the training.
Now for the testing, I do not really find any details on how to specify the model directory in which the output of the training gets stored. I mounted the output from training with model_dir as mount point. I assumed that the correct command should look something like this:
floyd run --cpu --env tensorflow-1.2 --data janinanu/datasets
/data/2:tut_test --data janinanu/projects/retrieval-based-dialogue-system-
on-ubuntu-corpus/18/output:model_dir 'python udc_test.py
--model_dir=?'
I have no idea what to put in the --model_dir=?
Correspondingly, I assumed that I have to adjust some lines in the test file:
tf.flags.DEFINE_string("test_file", "/tut_test/test.tfrecords", "Path of
test data in TFRecords format")
tf.flags.DEFINE_string("model_dir", "/model_dir", "Directory to load model
checkpoints from")
...as well as in the train file (not sure about that though...):
tf.flags.DEFINE_string("input_dir", "/tut_train", "Directory containing
input data files 'train.tfrecords' and 'validation.tfrecords'")
tf.flags.DEFINE_string("model_dir", "/model_dir", "Directory to store
model checkpoints (defaults to /model_dir)")
When I use e.g. --model_dir=/model_dir and the code with the above adjustments, I get this error:
2017-12-22 12:17:49,048 INFO - return func(*args, **kwargs)
2017-12-22 12:17:49,048 INFO - File "/usr/local/lib/python3.5/site-
packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py",
line 543, in evaluate
2017-12-22 12:17:49,048 INFO - log_progress=log_progress)
2017-12-22 12:17:49,049 INFO - File "/usr/local/lib/python3.5/site-
packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py",
line 816, in _evaluate_model
2017-12-22 12:17:49,049 INFO - % self._model_dir)
2017-12-22 12:17:49,049 INFO -
tensorflow.contrib.learn.python.learn.estimators._sklearn.NotFittedError:
Couldn't find trained model at /model_dir
Which doesn't come as a surprise.
Can anyone give me some clarification on how to feed the training output into the test run?
I will also post this question in the Floydhub Forum.
Thanks!!
.
You can mount the output of any job just like you mount a data. In your example:
--data janinanu/projects/retrieval-based-dialogue-system-
on-ubuntu-corpus/18/output:model_dir
should mount the entire output directory from run 18 to /mount_dir of the new job.
You can confirm this by viewing the job page (select the "data" tab to see what datasets are mounted at which paths).
In you case, can you confirm if the test is looking for the correct model filename?
I will also respond to this in the FloydHub forum.
I am trying to use DensNet for regression problem with TF-Slim. My data contains 60000 jpeg images with 37 float labels for each image. I divided my data into three different tfrecords files of a train set (60%), a validation set (20%) and a test set (20%).
I need to evaluate validation set during training loop and make a plot like image.
In TF-Slim documentation they just explain train loop and evaluation loop separately. I can just evaluate validation or test set after training loop finished. While as I said I need to evaluate during training.
I tried to use slim.evaluation.evaluation_loop function instead of slim.evaluation.evaluate_once. But it doesn't help.
slim.evaluation.evaluation_loop(
master=FLAGS.master,
checkpoint_dir=checkpoint_path,
logdir=FLAGS.eval_dir,
num_evals=num_batches,
eval_op=list(names_to_updates.values()) + print_ops,
variables_to_restore=variables_to_restore,
summary_op = tf.summary.merge(summary_ops),
eval_interval_secs = eval_interval_secs )
I tried evaluation.evaluate_repeatedly as well.
from tensorflow.contrib.training.python.training import evaluation
evaluation.evaluate_repeatedly(
master=FLAGS.master,
checkpoint_dir=checkpoint_path,
eval_ops=list(names_to_updates.values()) + print_ops,
eval_interval_secs = eval_interval_secs )
In both of these functions, they just read the latest available checkpoint from checkpoint_dir and apparently waiting for the next one, however when the new checkpoints are generated, they don't perform at all.
I use Python 2.7.13 and Tensorflow 1.3.0 on CPU.
Any help will be highly appreciated.
Using evaluate_once works just fine with bash script using sleep. Appears that Tensorboard is capable plotting multiple single runs from given eval_dir...
So I use something like:
#!/bin/bash
set -e
# Paths to model and evaluation results
TRAIN_DIR=~/pDL/tensorflow/model/mobilenet_v1_1_224_rp-v1/run0004
TEST_DIR=${TRAIN_DIR}/eval
# Where the dataset is saved to.
DATASET_DIR=/mnt/data/tensorflow/data
# Run evaluation (using slim.evaluation.evaluate_once)
CONTINUE=1
while [ "$CONTINUE" -ne 0 ]
do
python eval_image_classifier.py \
--checkpoint_path=${TRAIN_DIR} \
--eval_dir=${TEST_DIR} \
--dataset_name=master_db \
--preprocessing_name=preprocess224 \
--dataset_split_name=valid \
--dataset_dir=${DATASET_DIR} \
--model_name=mobilenet_v1 \
--patch_size=64
echo "sleeping for next run"
sleep 600
done
This appear to be issue of setting the checkpoint_path properly as addressed here:
https://github.com/tensorflow/tensorflow/issues/13769
Where the answer is by Ellie68 setting:
if tf.gfile.IsDirectory(FLAGS.checkpoint_path):
if tf.train.latest_checkpoint(FLAGS.checkpoint_path):
checkpoint_path = tf.train.latest_checkpoint(FLAGS.checkpoint_path)
else:
checkpoint_path = FLAGS.checkpoint_path