keras with tf backend: how to identify variables (tensors) which are in a graph - tensorflow

I've built (in jupyter notebook with Python 3.6) a long ML proof of concept, which, in essence, has 3 parts: load & prepare data; train network; use network.
I would like to be able to re-run it from "train network" without the "cost" of preparing the data again & again (even loading the prepared data from a save file takes a noticeable amount of time).
When I run all cells from the start of the network training (the first cell of which includes a K.clear_session to wipe out any previous network - needed if the architecture changes) it fails as, part way through, there are still variables stored (with the same names) which are part of the old graph.
I can see two simple solutions (but you may be able to advise a better method to tidy up):
loop through all the defined variables (Tensors) in global() and del any which are Tensors (implicitly all part of the old session and graph),
or (better)
loop through all the tensors defined in the (old) graph del'ing them before del'ing the (old) graph.
I can see K.get_uid but can't see how I can use this info to accomplish what I need.
In the meantime I have to reset and rerun the whole workbook everytime I make adjustments to the network.
Is there a better way?

Related

Gensim word2vec saves numpy arrays?

I am running the Word2Vec implementation from gensim twice, and I have a problem with the save function:
model_ = gensim.models.Word2Vec(all_doc, size=int(config['MODEL']['embed_size']),
window=int(config['MODEL']['window']),
workers=multiprocessing.cpu_count(),
sg=1, iter=int(config['MODEL']['iteration']),
negative=int(config['MODEL']['negative']),
min_count=int(config['MODEL']['min_count']), seed=int(config['MODEL']['seed']))
model_.save(config['BASIC']['embedding_dir'])
I obtain different outputs for each time I run it. The first time it gives an "output_embedding", an "output_embedding.trainables.syn1neg.npy" and an "output_embedding.wv.vectors.npy". But the second time it does not give the two npy files, it just generates "output_embedding".
The only thing I change from the first to the second time is the sentences I use as input (all_doc).
Why it does not generate the 3 files ?
Gensim only creates the separate files when the size of the internal numpy arrays is over a certain threshold – so I suspect your all_doc corpus has a very small vocabulary in one case, and a more typically large vocabulary in the other.
When it does generate multiple files, be sure to keep them all together for later loads to work.
(If for some urgent reason you needed to change that behavior, the inherited .save() method takes an optional sep_limit argument to change the threshold - but I'd recommend against mucking with this.)
Separately: that your file names have .trainables. in them suggests you're using a pre-4.0.0 version of Gensim. There've been some improvements to Word2Vec & related algorithms in the latest Gensim, and some older code will need small changes to keep working, so you may want to upgrade to the latest version before building any more functionality on an older base.

Problem when predicting via multiprocess with Tensorflow

I have 4 (or more) models (same structure but different training data). Now I want to ensemble them to make a prediction. I want to pre-load the models and then predict one input message (one message at a time) in parallel via multiprocess. However, the program always stops at "session.run" step. I could not figure it out why.
I tried passing all arguments to the function in each process, as shown in the code below. I also tried using a Queue object and put all the data (except the model object) in the queue. I also tried to set the number of process to 1. It made no difference.
with Manager() as manager:
first_level_test_features=manager.list()
procs =[]
for id in range(4):
p = Process(target=predict, args=(id, (message, models, configs, vocabs, emoji_dict,first_level_test_features)))
procs.append(p)
p.start()
for p in procs:
p.join()
I did not get any error message since it is just stuck there. I would expect the program can start multiple processes and each process uses the model pass to it to make the prediction.
I am unsure how session sharing along different Processes would work, and this is probably where your issue comes from. Given the way TensorFlow works, I would advise implementing the ensemble call as a graph operation, so that it can be run through a single session.run call, with TF handling the parallelization of computations wherever possible.
In practice, if you have symbolic tensors representing the models' predictions, you could use a TF operation to aggregate them (tf.concat, tf.reduce_mean, tf.add_n... whichever suits your design) and end up with a single symbolic tensor representing the ensemble prediction.
I hope this helps; if not, please provide some more details as to what your setting is, notably which form your models have.

How to assign value to one graph with the other graph who has the same structure in tensorflow?

I'm trying to implement DQN in tensorflow. Here I have one target network and one training network who have the same structure with each other. In the beginning of every 10000 training steps, I want to load the value from checkpoint to target network and training network, then stop_gradient target network. However, I tried those ways, and none of them worked:
1, Put the two networks in one graph. However, every time I load them, I don't know how to assign the value of training network part to target network part.(They are saved in different values, since one is stop gradient.)
2, Define two graphs using tf.graph() and run two session respectively. However, I can't load the checkpoint of one graph to another, even they have the same structure. After all, they are two different graphs.
So, any one who can give me some advice? Very appreciated!
The typical approach would be to put everything in one graph, put your two networks in two name scopes, and then create tf.assign ops for each variable in one scope to the another and use the tf.group to construct a final "copying" operation. Lets assume that function create_net() builds a single network
with tf.name_scope('main_network'):
main_net = create_net()
with tf.name_scope('target_network):
target_network = create_net()
main_variables = tf.get_collection(tf.GraphKeys.VARIABLES, scope='main_network')
target_variables = tf.get_collection(tf.GraphKeys.VARIABLES, scope='target_network')
# I am assuming get_collection returns variables in the same order, please double
# check this is actually happening
assign_ops = []
for main_var, target_var in zip(main_variables, target_variables):
assign_ops.append(tf.assign(target_var, tf.identity(main_var)))
copy_operation = tf.group(*assign_ops)
Now executing copy_operation in session.run should copy your main network parameters to the target network. The above code should be considered a pseudo code, rather than something you can copy&paste.

When should we use the place_pruned_graph config?

Question:As the title saied, I wander When should we use the config place_pruned_graph in GraphOptions. What's the purpose of this config?
I'm not clear to the comment about this config:
// Only place the subgraphs that are run, rather than the entire graph.
//
// This is useful for interactive graph building, where one might
// produce graphs that cannot be placed during the debugging
// process. In particular, it allows the client to continue work in
// a session after adding a node to a graph whose placement
// constraints are unsatisfiable.
We know that Tensorflow will partition a entire graph into several subgraphs in normal. And the following code from CreateGraphs of direct_session.cc takes the else branch in normal.(as far as I can see, I never found the case taking the if branch(so I don't know when should we trigger it).
if (options_.config.graph_options().place_pruned_graph()) {
// Because we are placing pruned graphs, we need to create a
// new SimpleGraphExecutionState for every new unseen graph,
// and then place it.
SimpleGraphExecutionStateOptions prune_options;
prune_options.device_set = &device_set_;
prune_options.session_options = &options_;
prune_options.stateful_placements = stateful_placements_;
TF_RETURN_IF_ERROR(SimpleGraphExecutionState::MakeForPrunedGraph(
execution_state_->original_graph_def().library(), prune_options,
execution_state_->original_graph_def(), subgraph_options,
&temp_exec_state_holder, &client_graph));
execution_state = temp_exec_state_holder.get();
} else {
execution_state = execution_state_.get();
TF_RETURN_IF_ERROR(
execution_state->BuildGraph(subgraph_options, &client_graph));
}
The short answer? Never. The longer answer requires me to explain why this option exists at all.
So why does TensorFlow include this convoluted configuration option and logic to handle it? It's a historical accident that came about when tensorflow::DirectSession and tensorflow::GrpcSession had different internal implementations:
The tensorflow::GrpcSession used a single SimpleGraphExecutionState for the entire graph in a session. The net effect of this was that the placer—which is responsible for assigning devices to each node in the graph—would run before the graph was pruned.
The tensorflow::DirectSession originally used one SimpleGraphExecutionState for each pruned subgraph, with some special logic for sharing the placements of stateful nodes between invocations. Therefore the placer would run after the graph was pruned, and could make different decisions about where to place stateful nodes.
The benefit of the tensorflow::GrpcSession approach (place_pruned_graph = false) is that it takes into account all of the colocation constraints in the graph when running the placement algorithm, even if they don't occur in the subgraph being executed. For example, if you had an embedding matrix, and wanted to optimize it using the SparseApplyAdagrad op (which only has a CPU implementation), TensorFlow would figure out that the embedding matrix should be placed on CPU.
By contrast, if you specified no device for the embedding matrix and set placed_pruned_graph = true the matrix would (most likely) be placed on GPU when you ran its initializer, because all of the ops in the initialization subgraph would be runnable on GPU. And, since variables cannot move between devices, TensorFlow would not be able to issue the subgraph that ran SparseApplyAdagrad on the matrix. This was a real issue in the earliest version of TensorFlow.
So why support place_pruned_graph = true at all? It turns out that it is useful when using TensorFlow interactively. The placed_pruned_graph = false option is unforgiving: once the graph for a session contains a node that cannot be placed, that session is useless, because the placement algorithm runs on the whole graph, it would fail every time it is invoked, and therefore no steps could run. When you use a tf.InteractiveSession, we assume that you are using a REPL (or Jupyter notebook) and that it's beneficial to allow you to continue after making such a mistake. Therefore in a tf.InteractiveSession we set place_pruned_graph = true so that you can continue to use the session after adding an unplaceable node (as long as you don't try to run that node in a pruned subgraph).
There is probably a better approach than place_pruned_graph = true for interactive use, but we haven't investigated adding one. Suggestions are always welcome on the GitHub issues page.

Can I change Inv operation into Reciprocal in an existing graph in Tensorflow?

I am working on an image classification problem with tensorflow. I have 2 different CNNs trained separately (in fact 3 in total but I will deal with the third later), for different tasks and on a AWS (Amazon) machine. One tells if there is text in the image and the other one tells if the image is safe for work or not. Now I want to use them in a single script on my computer, so that I can put an image as input and get the results of both networks as output.
I load the two graphs in a single tensorflow Session, using the import_meta_graph API and the import_scope argument and putting each subgraph in a separate scope. Then I just use the restore method of the created saver, giving it the common Session as argument.
Then, in order to run inference, I retrieve the placeholders and final output with graph=tf.get_default_graph() and my_var=graph.get_operation_by_name('name').outputs[0] before using it in sess.run (I think I could just have put 'name' in sess.run instead of fetching the output tensor and putting it in a variable, but this is not my problem).
My problem is the text CNN works perfectly fine, but the nsfw detector always gives me the same output, no matter the input (even with np.zeros()). I have tried both separately and same story: text works but not nsfw. So I don't think the problem comes from using two networks simultaneaously.
I also tried on the original AWS machine I trained it on, and this time the nsfw CNN worked perfectly.
Both networks are very similar. I checked on Tensorboard if everything was fine and I think it is ok. The differences are in the number of hidden units and the fact that I use batch normalization in the nsfw model and not in the text one. Now why this title ? I observed that I had a warning when running the nsfw model that I didn't have when using only the text model:
W tensorflow/core/framework/op_def_util.cc:332] Op Inv is deprecated. It will cease to work in GraphDef version 17. Use Reciprocal.
So I thougt maybe this was the reason, everything else being equal. I checked my GraphDef version, which seems to be 11, so Inv should still work in theory. By the way the AWS machine use tensroflow version 0.10 and I use version 0.12.
I noticed that the text network only had one Inv operation (via a filtering on the names of the operations given by graph.get_operations()), and that the nsfw model had the same operation plus multiple Inv operations due to the batch normalization layers. As precised in the release notes, tf.inv has simply been renamed to tf.reciprocal, so I tried to change the names of the operations to Reciprocal with tf.group(), as proposed here, but it didn't work. I have seen that using tf.identity() and changing the name could also work, but from what I understand, tensorflow graphs are an append-only structure, so we can't really modify its operations (which seems to be immutable anyway).
The thing is:
as I said, the Inv operation should still work in my GraphDef version;
this is only a warning;
the Inv operations only appear under name scopes that begin with 'gradients' so, from my understanding, this shouldn't be used for inference;
the text model also have an Inv operation.
For these reasons, I have a big doubt on my diagnosis. So my final questions are:
do you have another diagnosis?
if mine is correct, is it possible to replace Inv operations with Reciprocal operations, or do you have any other solution?
After a thorough examination of the output of relevant nodes, with the help of Tensorboard, I am now pretty certain that the renaming of Inv to Reciprocal has nothing to do with my problem.
It appears that the last batch normalization layer eliminates almost any variance of its output when the inputs varies. I will ask why elsewhere.