How to run tensorflow retrain.py from other script? - tensorflow

I am writing a script to automate a training using the main() function in the tensorflow retrain.py. This script is normally called from the shell with parsed arguments. In retrain.py:
if __name__ == __main__:
parser = argparse.ArgumentParser()
parser.add_argument(
'--image_dir',
type=str,
default='',
help='Path to folders of labeled images.'
)
parser.add_argument(
'--output_graph',
type=str,
default='/tmp/output_graph.pb',
help='Where to save the trained graph.'
)
...
FLAGS, unparsed = parser.parse_known_args()
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
I understand that tensorflow usually handles the FLAGS argument as global variable, but I don't understand how this variable is set as global, since in the code snippet, FLAGS should be an argparse.Namespace object.
However, I've tried to define the FLAGS variable manually in my own script:
from scripts.retrain import main
...
if __name__ == '__main__':
tf.app.flags.DEFINE_string('summaries_dir', summaries_dir, 'Help summaries_dir.')
tf.app.flags.DEFINE_string('image_dir', image_dir, 'Help image_dir.')
...
FLAGS = tf.app.flags.FLAGS
tf.app.run(main=main, argv=[sys.argv[0]] + ['python -m scripts.retrain.py'])
And always get the error AttributeError: 'NoneType' object has no attribute 'summaries_dir'. How should I run the retrain.py from my script?

Related

!gcloud dataproc jobs submit pyspark - ERROR AttributeError: 'str' object has no attribute 'batch'

how i can input dataset - type as input to dataproc jobs ?
mine code is below
%%writefile spark_job.py
import sys
import pyspark
import argparse
import pickle
#def time_configs_rdd(test_set, batch_sizes,batch_numbers,repetitions):
def time_configs_rdd(argv):
print(argv)
parser = argparse.ArgumentParser() # get a parser object
parser.add_argument('--out_bucket', metavar='out_bucket', required=True,
help='The bucket URL for the result.') # add a required argument
parser.add_argument('--out_file', metavar='out_file', required=True,
help='The filename for the result.') # add a required argument
parser.add_argument('--batch_size', metavar='batch_size', required=True,
help='The bucket URL for the result.') # add a required argument
parser.add_argument('--batch_number', metavar='batch_number', required=True,
help='The filename for the result.') # add a required argument
parser.add_argument('--repetitions', metavar='repetitions', required=True,
help='The filename for the result.') # add a required argument
parser.add_argument('--test_set', metavar='test_set', required=True,
help='The filename for the result.') # add a required argument
args = parser.parse_args(argv) # read the value
# the value provided with --out_bucket is now in args.out_bucket
time_configs_results = []
for s in args.batch_size:
for n in args.batch_number:
dataset = **args.test_set.batch(s).take(n)**
for r in args.repetitions:
tt0 = time.time()
for i in enumerate(dataset):
totaltime = str(time.time()-tt0)
batchtime = totaltime
#imgpersec = s*n/totaltime
time_configs_results.append((s,n,r,float(batchtime)))
#time_configs_results.append((s,n,r,batchtime,imgpersec))
time_configs_results_rdd = sc.parallelize(time_configs_results) #create an RDD with all results for each parameter
time_configs_results_rdd_avg = time_configs_results_rdd.map(lambda x: (x, x[0]*x[1]/x[3])) #RDD with the average reading speeds (RDD.map)
#mapping = time_configs_results_rdd_avg.collect()
#print(mapping)
return (time_configs_results_rdd_avg)
if 'google.colab' not in sys.modules: # Don't use system arguments when run in Colab
time_configs_rdd(sys.argv[1:])
elif __name__ == "__main__" : # but define them manually
time_configs_rdd(["--out_bucket", BUCKET, "--out_file", "time_configs_rdd_out.pkl","--batch_size", batch_size, "--batch_number", batch_number,"--test_set", test_set ] )
and code to execute it
FILENAME = 'file_RDD_OUT.pkl'
batch_size = [1]
batch_number = [1]
repetitions = [1]
#test_set = 1 will give string error
test_set = dataset2 # file <ParallelMapDataset shapes: ((192, 192, None), ()), types: (tf.float32,
tf.string)> cannot be inserted
!gcloud dataproc jobs submit pyspark --cluster $CLUSTER --region $REGION \
./spark_job.py \
-- --out_bucket $BUCKET --out_file $FILENAME --batch_size $batch_size --batch_number $batch_number --repetitions $repetitions --test_set $test_set
unfortunetlly is keep failing with error
AttributeError: 'str' object has no attribute 'batch'
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [c2048c422f334b08a628af5a1aa492eb] failed with error:
Job failed with message [AttributeError: 'str' object has no attribute 'batch'].
problem is with test_set how i should convert dataset2(ParallelMapDataset) to be readed by the job
So you are trying to parse a string from the command line argument to a ParallelMapDataset type. You want to use the type param in your add_argument calls.
From https://docs.python.org/3/library/argparse.html#type and I quote:
By default, ArgumentParser objects read command-line arguments in as simple strings. However, quite often the command-line string should instead be interpreted as another type, like a float or int. The type keyword argument of add_argument() allows any necessary type-checking and type conversions to be performed.
and
type= can take any callable that takes a single string argument and returns the converted value
So you probably want something like:
def parse_parallel_map_dataset(string):
# your logic to parse the string into your desired data structure
...
parser.add_argument('--test_set', metavar='test_set', required=True,
type=parse_parallel_map_dataset)
Or better yet, read your test_set from a file and pass the file name as an argument.

How to use tf.train.Saver in SessionRunHook?

I have trained many sub-models, each sub-models is a part of the last model. And then I want to use those pretrained sub models to initial the last model's parameters. I try to use SessionRunHook to load other ckpt file's model parameters to initial the last model's.
I tried the follow code but failed. Hope some advices. Thanks!
The error info is:
Traceback (most recent call last):
File "train_high_api_local.py", line 282, in <module>
tf.app.run()
File "/Users/zhouliaoming/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 124, in run
_sys.exit(main(argv))
File "train_high_api_local.py", line 266, in main
clf_.train(input_fn=lambda: read_file([tables[0]], epochs_per_eval), steps=None, hooks=[hook_test]) # input yield: x, y
File "/Users/zhouliaoming/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 314, in train
.......
File "/Users/zhouliaoming/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 674, in create_session
hook.after_create_session(self.tf_sess, self.coord)
File "train_high_api_local.py", line 102, in after_create_session
saver = tf.train.Saver([ti]) # TODO: ERROR INFO: Graph is finalized and cannot be modified.
.......
File "/Users/zhouliaoming/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3135, in create_op
self._check_not_finalized()
File "/Users/zhouliaoming/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2788, in _check_not_finalized
raise RuntimeError("Graph is finalized and cannot be modified.")
RuntimeError: Graph is finalized and cannot be modified.
and the code detail is:
class SetTensor(session_run_hook.SessionRunHook):
""" like tf.train.LoggingTensorHook """
def after_create_session(self, session, coord):
""" Called when new TensorFlow session is created: graph is finalized and ops can no longer be added. """
graph = tf.get_default_graph()
ti = graph.get_tensor_by_name("h_1_15/bias:0")
with session.as_default():
with tf.name_scope("rewrite"):
saver = tf.train.Saver([ti]) # TODO: ERROR INFO: Graph is finalized and cannot be modified.
saver.restore(session, "/Users/zhouliaoming/data/credit_dnn/model_retrain/rm_gene_v2_sall/model.ckpt-2102")
pass
def main(unused_argv):
""" train """
norm_all_func = lambda x: tf.cond(x>1, lambda: tf.log(x), lambda: tf.identity(x))
feature_columns=[[tf.feature_column.numeric_column(COLUMNS[i], shape=fi, normalizer_fn=lambda x: tf.py_func(weight_norm2, [x], tf.float32) )] for i, fi in enumerate(FEA_DIM)] # normlized: running OK!
## use self-defined model
param = {"learning_rate": 0.0001, "feature_columns": feature_columns, "isanalysis": FLAGS.isanalysis, "isall": False}
clf_ = tf.estimator.Estimator(model_fn=model_fn_wide2deep, params=param, model_dir=ckpt_dir)
hook_test = SetTensor(["h_1_15/bias", "h_1_15/kernel"])
epochs_per_eval = 1
for n in range(int(FLAGS.num_epochs/epochs_per_eval)):
# train num_epochs
clf_.train(input_fn=lambda: read_file([tables[0]], epochs_per_eval), steps=None, hooks=[hook_test]) # input yield: x, y
SessionRunHook is not meant for this use case. As the error says, you cannot change the graph once sess.run() has been invoked.
You can assign variables using saver.restore() in your "normal code". You don't have to be inside any hooks.
Also, if you want to restore many variables and can match them to their names and shapes in a checkpoint, you might want to take a look at https://gist.github.com/iganichev/d2d8a0b1abc6b15d4a07de83171163d4. It shows some example code to restore a subset of variables.
You can do this:
class SaveAtEnd(tf.train.SessionRunHook):
def begin(self):
self._saver = # create your saver
def end(self, session):
self._saver.save(session, ...)

Tensorflow - saving the checkpoint files as .pb, but with no output node names

I have the following files:
model.ckpt-2400.data-00000-of-00001
model.ckpt-2400.index
model.ckpt-2400.meta
And I would like to save them in the form of a .pb with the following function:
def freeze_graph(model_dir, output_node_names):
"""Extract the sub graph defined by the output nodes and convert all its variables into constant
Args:
model_dir: the root folder containing the checkpoint state file
output_node_names: a string, containing all the output node's names,
comma separated
"""
if not tf.gfile.Exists(model_dir):
raise AssertionError(
"Export directory doesn't exists. Please specify an export "
"directory: %s" % model_dir)
if not output_node_names:
print("You need to supply the name of a node to --output_node_names.")
return -1
# We retrieve our checkpoint fullpath
checkpoint = tf.train.get_checkpoint_state(model_dir)
input_checkpoint = checkpoint.model_checkpoint_path
# We precise the file fullname of our freezed graph
absolute_model_dir = "/".join(input_checkpoint.split('/')[:-1])
output_graph = absolute_model_dir + "/frozen_model.pb"
# We clear devices to allow TensorFlow to control on which device it will load operations
clear_devices = True
# We start a session using a temporary fresh Graph
with tf.Session(graph=tf.Graph()) as sess:
# We import the meta graph in the current default Graph
saver = tf.train.import_meta_graph(input_checkpoint + '.meta', clear_devices=clear_devices)
# We restore the weights
saver.restore(sess, input_checkpoint)
# We use a built-in TF helper to export variables to constants
output_graph_def = tf.graph_util.convert_variables_to_constants(
sess, # The session is used to retrieve the weights
tf.get_default_graph().as_graph_def(), # The graph_def is used to retrieve the nodes
output_node_names.split(",") # The output node names are used to select the usefull nodes
)
# Finally we serialize and dump the output graph to the filesystem
with tf.gfile.GFile(output_graph, "wb") as f:
f.write(output_graph_def.SerializeToString())
print("%d ops in the final graph." % len(output_graph_def.node))
return output_graph_def
The problem is that when I use tf.get_default_graph().as_graph_def().node, it returns []. An empty array. There are no output node names I can use for this.
So how else can I save them as .pb? Should I just refer to the tf.python.tools.freeze_graph.freeze_graph() function?
Turns out all I needed to do is to supply the name of the output node... that I, in another part of my code, designated as the node to log to check the results.
predictions = {
# Generate predictions (for PREDICT and EVAL mode)
"classes": tf.argmax(input=logits, axis=1),
# Add `softmax_tensor` to the graph. It is used for PREDICT and by the
# `logging_hook`.
"probabilities": tf.nn.softmax(logits, name="softmax_tensor") #This one
}
In my case it's softmax_tensor.

how to convert .ckpt file to .pb

I use ssd_mobilenets in Object detection API to train my own model, and get .ckpt files. It works well on my computer, but now I want to use the model on my phone. So, I need convert it to .pb file. I do not know how to do it, can any one help? By the way, the graph of ssd_mobilenets is complex, I can not find which is the output of model. Is there any one knowing the name of the output?
Use export_inference_graph.py to convert model checkpoint file into a .pb file.
python tensorflow_models/object_detection/export_inference_graph.py \
--input_type image_tensor \
--pipeline_config_path architecture_used_while_training.config \
--trained path_to_saved_ckpt/model.ckpt-NUMBER \
--output_directory model/
This is the 4th code cell in object_detection_tutorial.ipynb in this link -https://github.com/tensorflow/models/blob/master/research/object_detection/object_detection_tutorial.ipynb
# What model to download.
MODEL_NAME = 'ssd_mobilenet_v1_coco_2017_11_17'
MODEL_FILE = MODEL_NAME + '.tar.gz'
DOWNLOAD_BASE = 'http://download.tensorflow.org/models/object_detection/'
# Path to frozen detection graph. This is the actual model that is used for the object detection.
PATH_TO_CKPT = MODEL_NAME + '/frozen_inference_graph.pb'
# List of the strings that is used to add correct label for each box.
PATH_TO_LABELS = os.path.join('data', 'mscoco_label_map.pbtxt')
NUM_CLASSES = 90
Now the cell clearly says the .pb filename which is /frozen_inference_graph.pb
So you already have the .pb file why do you want to convert ??
Anyways you can refer thsi link for freezing the graph: https://github.com/jayshah19949596/Tensorboard-Visualization-Freezing-Graph
you need to use tensorflow.python.tools.freeze_graph() function to convert your .ckpt file to .pb file
The below code line shows how you do it
freeze_graph.freeze_graph(input_graph_path,
input_saver_def_path,
input_binary,
input_checkpoint_path,
output_node_names,
restore_op_name,
filename_tensor_name,
output_graph_path,
clear_devices,
initializer_nodes)
input_graph_path : is the path to .pb file where you will write your graph and this .pb file is not frozen. you will use tf.train.write_graph() to write the graph
input_saver_def_path : you can keep it an empty string
input_binary : it is a boolean value keep it false so that the file genertaed is not binary and human readable
input_checkpoint_path : path to the .ckpt file
output_graph_path : path where you want to write you pb file
clear_devices : boolean value ... keep it False
output_node_names : explicit tensor node names that you want to save
restore_op_name : string value that should be "save/restore_all"
filename_tensor_name = "save/Const:0"
initializer_nodes = ""

Proper way to optimize the input in TensorFlow for visualization

I have trained a model in TensorFlow and now I would like to visualize which inputs maximally activate an output. I'd like to know what the cleanest way to do this is.
I had thought to do this by creating a trainable input variable which I can assign once per run. Then by using an appropriate loss function and using an optimizer with a var_list containing just this input variable I would update this input variable until convergence. i.e.
trainable_input = tf.get_variable(
'trainable_input',
shape=data_op.get_shape(),
dtype=data_op.dtype,
initializer=tf.zeros_initializer(),
trainable=True,
collections=[tf.GraphKeys.LOCAL_VARIABLES])
trainable_input_assign_op = tf.assign(trainable_input, data_op)
data_op = trainable_input
# ... run the rest of the graph building code here, now with a trainable input
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
# loss_op is defined on one of the outputs
train_op = optimizer.minimize(loss_op, var_list=[trainable_input])
However, when I do this I run into issues. If I try to restore the pre-trained graph using a Supervisor, then it naturally complains that the new variables created by the AdamOptimizer do not exist in the graph I'm trying to restore. I can remedy this by using get_slots to get the variables the AdamOptimizer creates and manually adding those variables to the tf.GraphKeys.LOCAL_VARIABLES collection, but it feels pretty hacky and I'm not sure what the consequences of this would be. I can also exclude those variables explicitly from the Saver that is passed to the Supervisor without adding them to the tf.GraphKeys.LOCAL_VARIABLES collection, but then I get an exception that they do not get properly initialized by the Supervisor:
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 973, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 801, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/coordinator.py", line 386, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.5/site-packages/six.py", line 686, in reraise
raise value
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 962, in managed_session
start_standard_services=start_standard_services)
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/supervisor.py", line 719, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/usr/local/lib/python3.5/site-packages/tensorflow/python/training/session_manager.py", line 280, in prepare_session
self._local_init_op, msg))
RuntimeError: Init operations did not make model ready. Init op: init, init fn: None, local_init_op: name: "group_deps_5"
op: "NoOp"
input: "^init_1"
input: "^init_all_tables"
, error: Variables not initialized: trainable_input/trainable_input/Adam, trainable_input/trainable_input/Adam_1
I'm not really sure why these variables are not getting initialized since I have used that technique before to exclude some variables from the restore process (GLOBAL and LOCAL) and they seem to get initialized as expected.
In short, my question is whether there is a simple way to add an optimizer to the graph and do a checkpoint restore (where the checkpoint does not contain the optimizer variables) without having to muck around with the internals of the optimizer. If that's not possible, then is there any downside to just adding the optimizer variables to the LOCAL_VARIABLES collection?
The same error occurs when I use slim library. In fact, the slim.learning.train() uses tf.train.Supervisor inside. I hope my answer on this GitHub issue may help your Supervisor problem.
I have the same problem with you. I solve it by doing following two steps.
1. pass the parameter saver to slim.learning.train()
ckpt = tf.train.get_checkpoint_state(FLAGS.train_dir)
saver = tf.train.Saver(var_list=optimistic_restore_vars(ckpt.model_checkpoint_path) if ckpt else None)
where function optimistic_restore_vars is defined as
def optimistic_restore_vars(model_checkpoint_path):
reader = tf.train.NewCheckpointReader(model_checkpoint_path)
saved_shapes = reader.get_variable_to_shape_map()
var_names = sorted([(var.name, var.name.split(':')[0]) for var in tf.global_variables() if var.name.split(':')[0] in saved_shapes])
restore_vars = []
name2var = dict(zip(map(lambda x:x.name.split(':')[0], f.global_variables()), tf.global_variables()))
with tf.variable_scope('', reuse=True):
for var_name, saved_var_name in var_names:
curr_var = name2var[saved_var_name]
var_shape = curr_var.get_shape().as_list()
if var_shape == saved_shapes[saved_var_name]:
restore_vars.append(curr_var)
return restore_vars
```
2. pass the parameter local_init_op to slim.learning.train() to initialize the added new variables
local_init_op = tf.global_variables_initializer()
In last, the code should look like this
ckpt = tf.train.get_checkpoint_state(FLAGS.train_dir)
saver = tf.train.Saver(var_list=optimistic_restore_vars ckpt.model_checkpoint_path) if ckpt else None)
local_init_op = tf.global_variables_initializer()
###########################
# Kicks off the training. #
###########################
learning.train(
train_tensor,
saver=saver,
local_init_op=local_init_op,
logdir=FLAGS.train_dir,
master=FLAGS.master,
is_chief=(FLAGS.task == 0),
init_fn=_get_init_fn(),
summary_op=summary_op,
number_of_steps=FLAGS.max_number_of_steps,
log_every_n_steps=FLAGS.log_every_n_steps,
save_summaries_secs=FLAGS.save_summaries_secs,
save_interval_secs=FLAGS.save_interval_secs,
sync_optimizer=optimizer if FLAGS.sync_replicas else None
)