How to specify model directory in Floydhub? - tensorflow

I am new to Floydhub. I am trying to run the code from this github repository and the corresponding tutorial.
For the training, I successfully used this command:
floyd run --gpu --env tensorflow-1.2 --data janinanu/dataset
/data/2:tut_train 'python udc_train.py'
I adjusted this line in the training file to work in Floydhub:
tf.flags.DEFINE_string("input_dir", "/tut_train", "Directory containing
input data files 'train.tfrecords' and 'validation.tfrecords'")
As said, this worked without problems for the training.
Now for the testing, I do not really find any details on how to specify the model directory in which the output of the training gets stored. I mounted the output from training with model_dir as mount point. I assumed that the correct command should look something like this:
floyd run --cpu --env tensorflow-1.2 --data janinanu/datasets
/data/2:tut_test --data janinanu/projects/retrieval-based-dialogue-system-
on-ubuntu-corpus/18/output:model_dir 'python udc_test.py
--model_dir=?'
I have no idea what to put in the --model_dir=?
Correspondingly, I assumed that I have to adjust some lines in the test file:
tf.flags.DEFINE_string("test_file", "/tut_test/test.tfrecords", "Path of
test data in TFRecords format")
tf.flags.DEFINE_string("model_dir", "/model_dir", "Directory to load model
checkpoints from")
...as well as in the train file (not sure about that though...):
tf.flags.DEFINE_string("input_dir", "/tut_train", "Directory containing
input data files 'train.tfrecords' and 'validation.tfrecords'")
tf.flags.DEFINE_string("model_dir", "/model_dir", "Directory to store
model checkpoints (defaults to /model_dir)")
When I use e.g. --model_dir=/model_dir and the code with the above adjustments, I get this error:
2017-12-22 12:17:49,048 INFO - return func(*args, **kwargs)
2017-12-22 12:17:49,048 INFO - File "/usr/local/lib/python3.5/site-
packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py",
line 543, in evaluate
2017-12-22 12:17:49,048 INFO - log_progress=log_progress)
2017-12-22 12:17:49,049 INFO - File "/usr/local/lib/python3.5/site-
packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py",
line 816, in _evaluate_model
2017-12-22 12:17:49,049 INFO - % self._model_dir)
2017-12-22 12:17:49,049 INFO -
tensorflow.contrib.learn.python.learn.estimators._sklearn.NotFittedError:
Couldn't find trained model at /model_dir
Which doesn't come as a surprise.
Can anyone give me some clarification on how to feed the training output into the test run?
I will also post this question in the Floydhub Forum.
Thanks!!
.

You can mount the output of any job just like you mount a data. In your example:
--data janinanu/projects/retrieval-based-dialogue-system-
on-ubuntu-corpus/18/output:model_dir
should mount the entire output directory from run 18 to /mount_dir of the new job.
You can confirm this by viewing the job page (select the "data" tab to see what datasets are mounted at which paths).
In you case, can you confirm if the test is looking for the correct model filename?
I will also respond to this in the FloydHub forum.

Related

Using LaBSE deployed to Google Cloud AI Platform

I deployed LaBSE model to AI Platform in the last past few days.
The issue I encounter is the answer of the request is above the limit of 2MB.
Several ideas I had to improve the situation:
make AI Platform return minified (not beautifully formatted) JSON (without spaces and newlines everywhere
make AI Plateform return the results in a binary format
since the answer is composed of ~13 outputs : change it to only one output
Do you know any ways of doing 1) or 2) ?
I spent lost of efforts on 3). I'm sure this one is possible. For example by editing the network before uploading it. Here are stuff I tried so far for that:
VERSION = 'v1'
MODEL = 'labse_2_b'
MODEL_DIR = BUCKET + '/' + MODEL
# Download the model
! wget 'https://tfhub.dev/google/LaBSE/2?tf-hub-format=compressed' \
--output-document='{MODEL}.tar.gz'
! mkdir {MODEL}
! tar -xzvf '{MODEL}.tar.gz' -C {MODEL}
# Attempts to load the model, edit it, and save it:
model.save(export_path, save_format='tf') # ValueError: Model <keras.engine.sequential.Sequential object at 0x7f87e833c650>
# cannot be saved because the input shapes have not been set.
# Usually, input shapes are automatically determined from calling
# `.fit()` or `.predict()`.
# To manually set the shapes, call `model.build(input_shape)`.
model.build(input_shape=(None,)) # cannot find a proper shape
# create a AI Plateform model version:
! gsutil -m cp -r '{MODEL}' {MODEL_DIR} # upload model to Google Cloud Storage
! gcloud ai-platform versions create $VERSION \
--model {MODEL} \
--origin {MODEL_DIR} \
--runtime-version=2.1 \
--framework='tensorflow' \
--python-version=3.7 \
--region="{REGION}"
Could some please help with with that?
Thanks a lot in advance,
EDIT :
For those wondering about this limitation, as in the comments below : Here are some complementary pieces of information:
A short sentence as
"I wish you a pleasant flight and a good meal aboard this plane."
which is just 16 parts of words long:
[101, 146, 34450, 15100, 170, 147508, 48088, 14999, 170, 17072, 66369, 351617, 15272, 69746, 119, 102]
cannot be processed:
Response size too large. Received at least 3220082 bytes; max is 2000000.". Details: "Response size too large. Received at least 3220082 bytes; max is 2000000.

ValueError: Input 0 of node Variable/Assign was passed int32 from Variable:0 incompatible with expected int32_ref

I am currently trying to get a trained TF seq2seq model working with Tensorflow.js. I need to get the json files for this. My input is a few sentences and the output is "embeddings". This model is working when I read in the checkpoint however I can't get it converted for tf.js. Part of the process for conversion is to get my latest checkpoint frozen as a protobuf (pb) file and then convert that to the json formats expected by tensorflow.js.
The above is my understanding and being that I haven't done this before, it may be wrong so please feel free to correct if I'm wrong in what I have deduced from reading.
When I try to convert to the tensorflow.js format I use the following command:
sudo tensorflowjs_converter --input_format=tf_frozen_model
--output_node_names='embeddings'
--saved_model_tags=serve
./saved_model/model.pb /web_model
This then displays the error listed in this post:
ValueError: Input 0 of node Variable/Assign was passed int32 from
Variable:0 incompatible with expected int32_ref.
One of the problems I'm running into is that I'm really not even sure how to troubleshoot this. So I was hoping that perhaps one of you maybe had some guidance or maybe you know what my issue may be.
I have upped the code I used to convert the checkpoint file to protobuf at the link below. I then added to the bottom of the notebook an import of that file that is then providing the same error I get when trying to convert to tensorflowjs format. (Just scroll to the bottom of the notebook)
https://github.com/xtr33me/textsumToTfjs/blob/master/convert_ckpt_to_pb.ipynb
Any help would be greatly appreciated!
Still unsure as to why I was getting the above error, however in the end I was able to resolve this issue by just switching over to using TF's SavedModel via tf.saved_model. A rough example of what worked for me can be found below should anyone in the future run into something similar. After saving out the below model, I was then able to perform the tensorflowjs_convert call on it and export the correct files.
if first_iter == True: #first time through
first_iter = False
#Lets try saving this badboy
cwd = os.getcwd()
path = os.path.join(cwd, 'simple')
shutil.rmtree(path, ignore_errors=True)
inputs_dict = {
"batch_decoder_input": tf.convert_to_tensor(batch_decoder_input)
}
outputs_dict = {
"batch_decoder_output": tf.convert_to_tensor(batch_decoder_output)
}
tf.saved_model.simple_save(
sess, path, inputs_dict, outputs_dict
)
print('Model Saved')
#End save model code

Using your own evaluation and training set in cloud-ml-engine sample

in the flowers tutorial by google here: https://cloud.google.com/ml-engine/docs/tensorflow/flowers-tutorial
For preproccessing of data we used the dollwoing command:
python trainer/preprocess.py \
--input_dict "$DICT_FILE" \
--input_path "gs://cloud-ml-data/img/flower_photos/train_set.csv" \
--output_path "${GCS_PATH}/preproc/train" \
--cloud
I understand we could replace the csv file with our own list and hence train with a different set of images, however creating a csv files for over a 100 types of images will be cumbersome, is there a way to overcome this?
The train_set.csv is a list of file paths in Google Cloud Storage and the prediction label.
This is a part of the file:
gs://cloud-ml-data/img/flower_photos/daisy/754296579_30a9ae018c_n.jpg,daisy
gs://cloud-ml-data/img/flower_photos/dandelion/18089878729_907ed2c7cd_m.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/dandelion/284497199_93a01f48f6.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/dandelion/3554992110_81d8c9b0bd_m.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/daisy/4065883015_4bb6010cb7_n.jpg,daisy
gs://cloud-ml-data/img/flower_photos/roses/7420699022_60fa574524_m.jpg,roses
gs://cloud-ml-data/img/flower_photos/dandelion/4558536575_d43a611bd4_n.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/daisy/7568630428_8cf0fc16ff_n.jpg,daisy
gs://cloud-ml-data/img/flower_photos/tulips/7064813645_f7f48fb527.jpg,tulips
gs://cloud-ml-data/img/flower_photos/sunflowers/4933229095_f7e4218b28.jpg,sunflowers
gs://cloud-ml-data/img/flower_photos/daisy/14523675369_97c31d0b5b.jpg,daisy
gs://cloud-ml-data/img/flower_photos/sunflowers/21518663809_3d69f5b995_n.jpg,sunflowers
gs://cloud-ml-data/img/flower_photos/dandelion/15782158700_3b9bf7d33e_m.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/tulips/8713398906_28e59a225a_n.jpg,tulips
gs://cloud-ml-data/img/flower_photos/tulips/6770436217_281da51e49_n.jpg,tulips
gs://cloud-ml-data/img/flower_photos/dandelion/8754822932_948afc7cef.jpg,dandelion
gs://cloud-ml-data/img/flower_photos/daisy/22873310415_3a5674ec10_m.jpg,daisy
gs://cloud-ml-data/img/flower_photos/sunflowers/5967283168_90dd4daf28_n.jpg,sunflowers
So you will have to collect a set of images for your own train set, upload it to the GCS and clasify them. Then you just have to retrieve the list of path (it could be easily achieve using gsutil ls command) and concatenate with the classification label.

Tensorflow - which is the real checkpoint file?

I'm trying to used pretrained models with checkpoints but I can't figure out which file should be put into config as checkpoint. There are files:
model.ckpt.meta
model.ckpt.index
model.ckpt.data0000-of-0001
I tried all of them but I see errors. In the different articles I saw just "model.ckpt" but there is no such files.
I tried to work with ssd_mobilenet from here: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md
model.ckpt.meta : File to store graph information
model.ckpt.index : File to store index of variables
model.ckpt.data0000-of-0001 : File to store value of variables
All three files are real checkpoint files. When you restore the model using tf.train.Saver.restore, parameter "save_path" should be "~model.ckpt". Your error probably occurred because of the invalid save_path. Check if three files(.meta, .index, .data~) in save_path(relative path or absolute path).

TensorFlow distributed master worker save fails silently; the checkpoint file isn't created but no exception is raised

In distribution tensorflow environment. the master worker saves checkpoint fail.
saver.save has return ok*(not raise exception and return the store checkpoint file path) but, the return checkpoint file is not exist.
this is not same as the description of the tensorflow api
Why? How to Fix it?
=============
the related code is below:
def def_ps(self):
self.saver = tf.train.Saver(max_to_keep=100,keep_checkpoint_every_n_hours=3)
def save(self,idx):
ret = self.saver.save(self.sess,self.save_model_path,global_step=None,write_meta_graph=False)
if not os.path.exists(ret):
msg = "save model for %u path %s not exists."%(idx,ret)
lg.error(msg)
raise Exception(msg);
=============
the log is below:
2016-06-02 21:33:52,323 root ERROR save model for 2 path model_path/rl_model_2 not exists.
2016-06-02 21:33:52,323 root ERROR has error:save model for 2 path model_path/rl_model_2 not exists.
Traceback (most recent call last):
File "d_rl_main_model_dist_0.py", line 755, in run_worker
model_a.save(next_model_idx)
File "d_rl_main_model_dist_0.py", line 360, in save
Trainer.save(self,save_idx)
File "d_rl_main_model_dist_0.py", line 289, in save
raise Exception(msg);
Exception: save model for 2 path model_path/rl_model_2 not exists.
===========
not meets the tensorflow api which define Saver.save as below:
https://www.tensorflow.org/versions/master/api_docs/python/state_ops.html#Saver
tf.train.Saver.save(sess, save_path, global_step=None, latest_filename=None, meta_graph_suffix='meta', write_meta_graph=True)
Returns:
A string: path at which the variables were saved. If the saver is sharded, this string ends with: '-?????-of-nnnnn' where 'nnnnn' is the number of shards created.
Raises:
TypeError: If sess is not a Session.
ValueError: If latest_filename contains path components.
The tf.train.Saver.save() method is a little... surprising when you run in distributed mode. The actual file is written by the process that holds the tf.Variable op, which is typically a process in "/job:ps" if you've used the example code to set things up. This means that you need to look in save_path on each of the remote machines that have variables to find the checkpoint files.
Why is this the case? The Saver API implicitly assumes that all processes have the same view of a shared file system, like an NFS mount, because that is the typical setup we use at Google. We've added support for Google Cloud Storage in the latest nightly versions of TensorFlow, and are investigating HDFS support as well.