I have 2 GPU machine available to use with ID 2 and 3, and would like to use them all to fit model. Here is my code,
os.environ['CUDA_VISIBLE_DEVICES'] = '2, 3'
with tf,device('/gpu:2'):
critic_model.fit(x,y,epochs =10)
with tf.device('/gpu:3'):
history = model.fit(x,y,epochs=19)
However, when I check nvidia-smi, I found only machine 2 is utilized, I wonder why ?
Any idea could be helpful !
Multiple problems here:
For one, Tensorflow has "its own" GPU numbering independent from the IDs on your machine. So when you pass CUDA_VISIBLE_DEVICES=2,3, Tensorflow will see those two GPUs, but they will be '/gpu:0' and '/gpu:1' in the program. Since neither '/gpu:2' nor '/gpu:3' exist, I suspect that all ops are simply put on '/gpu:0' or the CPU.
However, the main problem is that this is not how you use with tf.device at all. You need to wrap the model creation into the context manager. I.e. all the op calls such as tf.nn.conv2d, tf.matmul etc. need to be wrapped. At the point you call model.fit, the ops have already been created (and put on '/gpu:0' by default) and your with tf.device statement does nothing.
Related
I have 4 (or more) models (same structure but different training data). Now I want to ensemble them to make a prediction. I want to pre-load the models and then predict one input message (one message at a time) in parallel via multiprocess. However, the program always stops at "session.run" step. I could not figure it out why.
I tried passing all arguments to the function in each process, as shown in the code below. I also tried using a Queue object and put all the data (except the model object) in the queue. I also tried to set the number of process to 1. It made no difference.
with Manager() as manager:
first_level_test_features=manager.list()
procs =[]
for id in range(4):
p = Process(target=predict, args=(id, (message, models, configs, vocabs, emoji_dict,first_level_test_features)))
procs.append(p)
p.start()
for p in procs:
p.join()
I did not get any error message since it is just stuck there. I would expect the program can start multiple processes and each process uses the model pass to it to make the prediction.
I am unsure how session sharing along different Processes would work, and this is probably where your issue comes from. Given the way TensorFlow works, I would advise implementing the ensemble call as a graph operation, so that it can be run through a single session.run call, with TF handling the parallelization of computations wherever possible.
In practice, if you have symbolic tensors representing the models' predictions, you could use a TF operation to aggregate them (tf.concat, tf.reduce_mean, tf.add_n... whichever suits your design) and end up with a single symbolic tensor representing the ensemble prediction.
I hope this helps; if not, please provide some more details as to what your setting is, notably which form your models have.
When running tensorflow benchmarks from terminal, there are a couple of parameters we can specify. There is a parameter called gradient_repacking. What does it represent and how would one think about setting it?
python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 \
--model=resnet50 --optimizer=momentum --variable_update=replicated \
--nodistortions --gradient_repacking=8 --num_gpus=8 \
--num_epochs=90 --weight_decay=1e-4 --data_dir=${DATA_DIR} --use_fp16 \
--train_dir=${CKPT_DIR}
For those searching in the future, gradient_repacking affects all-reduce in replicated mode. From the flags definition:
flags.DEFINE_integer('gradient_repacking', 0, 'Use gradient repacking. It'
'currently only works with replicated mode. At the end of'
'of each step, it repacks the gradients for more efficient'
'cross-device transportation. A non-zero value specifies'
'the number of split packs that will be formed.',
lower_bound=0)
As for the optimal, I've seen gradient_repacking=8 as you have and gradient_repacking=2.
My best guess is the parameter refers to the number of shards the gradients get broken down into for sharing among other workers. Eight in this case would seem to mean each GPU shares with each other GPU (i.e. all-to-all) (for your num_gpus=8) while 2 would mean sharing only with neighbors in a ring fashion.
Given that Horovod uses its own all reduce algorithm, it makes sense that setting gradient_repacking has no effect when --variable_update=horovod.
Question:As the title saied, I wander When should we use the config place_pruned_graph in GraphOptions. What's the purpose of this config?
I'm not clear to the comment about this config:
// Only place the subgraphs that are run, rather than the entire graph.
//
// This is useful for interactive graph building, where one might
// produce graphs that cannot be placed during the debugging
// process. In particular, it allows the client to continue work in
// a session after adding a node to a graph whose placement
// constraints are unsatisfiable.
We know that Tensorflow will partition a entire graph into several subgraphs in normal. And the following code from CreateGraphs of direct_session.cc takes the else branch in normal.(as far as I can see, I never found the case taking the if branch(so I don't know when should we trigger it).
if (options_.config.graph_options().place_pruned_graph()) {
// Because we are placing pruned graphs, we need to create a
// new SimpleGraphExecutionState for every new unseen graph,
// and then place it.
SimpleGraphExecutionStateOptions prune_options;
prune_options.device_set = &device_set_;
prune_options.session_options = &options_;
prune_options.stateful_placements = stateful_placements_;
TF_RETURN_IF_ERROR(SimpleGraphExecutionState::MakeForPrunedGraph(
execution_state_->original_graph_def().library(), prune_options,
execution_state_->original_graph_def(), subgraph_options,
&temp_exec_state_holder, &client_graph));
execution_state = temp_exec_state_holder.get();
} else {
execution_state = execution_state_.get();
TF_RETURN_IF_ERROR(
execution_state->BuildGraph(subgraph_options, &client_graph));
}
The short answer? Never. The longer answer requires me to explain why this option exists at all.
So why does TensorFlow include this convoluted configuration option and logic to handle it? It's a historical accident that came about when tensorflow::DirectSession and tensorflow::GrpcSession had different internal implementations:
The tensorflow::GrpcSession used a single SimpleGraphExecutionState for the entire graph in a session. The net effect of this was that the placer—which is responsible for assigning devices to each node in the graph—would run before the graph was pruned.
The tensorflow::DirectSession originally used one SimpleGraphExecutionState for each pruned subgraph, with some special logic for sharing the placements of stateful nodes between invocations. Therefore the placer would run after the graph was pruned, and could make different decisions about where to place stateful nodes.
The benefit of the tensorflow::GrpcSession approach (place_pruned_graph = false) is that it takes into account all of the colocation constraints in the graph when running the placement algorithm, even if they don't occur in the subgraph being executed. For example, if you had an embedding matrix, and wanted to optimize it using the SparseApplyAdagrad op (which only has a CPU implementation), TensorFlow would figure out that the embedding matrix should be placed on CPU.
By contrast, if you specified no device for the embedding matrix and set placed_pruned_graph = true the matrix would (most likely) be placed on GPU when you ran its initializer, because all of the ops in the initialization subgraph would be runnable on GPU. And, since variables cannot move between devices, TensorFlow would not be able to issue the subgraph that ran SparseApplyAdagrad on the matrix. This was a real issue in the earliest version of TensorFlow.
So why support place_pruned_graph = true at all? It turns out that it is useful when using TensorFlow interactively. The placed_pruned_graph = false option is unforgiving: once the graph for a session contains a node that cannot be placed, that session is useless, because the placement algorithm runs on the whole graph, it would fail every time it is invoked, and therefore no steps could run. When you use a tf.InteractiveSession, we assume that you are using a REPL (or Jupyter notebook) and that it's beneficial to allow you to continue after making such a mistake. Therefore in a tf.InteractiveSession we set place_pruned_graph = true so that you can continue to use the session after adding an unplaceable node (as long as you don't try to run that node in a pruned subgraph).
There is probably a better approach than place_pruned_graph = true for interactive use, but we haven't investigated adding one. Suggestions are always welcome on the GitHub issues page.
I am working on an image classification problem with tensorflow. I have 2 different CNNs trained separately (in fact 3 in total but I will deal with the third later), for different tasks and on a AWS (Amazon) machine. One tells if there is text in the image and the other one tells if the image is safe for work or not. Now I want to use them in a single script on my computer, so that I can put an image as input and get the results of both networks as output.
I load the two graphs in a single tensorflow Session, using the import_meta_graph API and the import_scope argument and putting each subgraph in a separate scope. Then I just use the restore method of the created saver, giving it the common Session as argument.
Then, in order to run inference, I retrieve the placeholders and final output with graph=tf.get_default_graph() and my_var=graph.get_operation_by_name('name').outputs[0] before using it in sess.run (I think I could just have put 'name' in sess.run instead of fetching the output tensor and putting it in a variable, but this is not my problem).
My problem is the text CNN works perfectly fine, but the nsfw detector always gives me the same output, no matter the input (even with np.zeros()). I have tried both separately and same story: text works but not nsfw. So I don't think the problem comes from using two networks simultaneaously.
I also tried on the original AWS machine I trained it on, and this time the nsfw CNN worked perfectly.
Both networks are very similar. I checked on Tensorboard if everything was fine and I think it is ok. The differences are in the number of hidden units and the fact that I use batch normalization in the nsfw model and not in the text one. Now why this title ? I observed that I had a warning when running the nsfw model that I didn't have when using only the text model:
W tensorflow/core/framework/op_def_util.cc:332] Op Inv is deprecated. It will cease to work in GraphDef version 17. Use Reciprocal.
So I thougt maybe this was the reason, everything else being equal. I checked my GraphDef version, which seems to be 11, so Inv should still work in theory. By the way the AWS machine use tensroflow version 0.10 and I use version 0.12.
I noticed that the text network only had one Inv operation (via a filtering on the names of the operations given by graph.get_operations()), and that the nsfw model had the same operation plus multiple Inv operations due to the batch normalization layers. As precised in the release notes, tf.inv has simply been renamed to tf.reciprocal, so I tried to change the names of the operations to Reciprocal with tf.group(), as proposed here, but it didn't work. I have seen that using tf.identity() and changing the name could also work, but from what I understand, tensorflow graphs are an append-only structure, so we can't really modify its operations (which seems to be immutable anyway).
The thing is:
as I said, the Inv operation should still work in my GraphDef version;
this is only a warning;
the Inv operations only appear under name scopes that begin with 'gradients' so, from my understanding, this shouldn't be used for inference;
the text model also have an Inv operation.
For these reasons, I have a big doubt on my diagnosis. So my final questions are:
do you have another diagnosis?
if mine is correct, is it possible to replace Inv operations with Reciprocal operations, or do you have any other solution?
After a thorough examination of the output of relevant nodes, with the help of Tensorboard, I am now pretty certain that the renaming of Inv to Reciprocal has nothing to do with my problem.
It appears that the last batch normalization layer eliminates almost any variance of its output when the inputs varies. I will ask why elsewhere.
My desktop has two gpus which can run Tensorflow with specification /gpu:0 or /gpu:1. However, if I don't specify which gpu to run the code, Tensorflow will by default to call /gpu:0, as we all know.
Now I would like to setup the system such that it can assign gpu dynamically according to the free memory of each gpu. For example, if a script doesn't specify which gpu to run the code, the system first assigns /gpu:0 for it; then if another script runs now, it will check whether /gpu:0 has enough free memory. If yes, it will continue assign /gpu:0 to it, otherwise it will assign /gpu:1 to it. How can I achieve it?
Follow-ups:
I believe the question above may be related to the virtualization problem of GPU. That is to say, if I can virtualize multi-gpu in a desktop into one GPU, I can get what I want. So beside any setup methods for Tensorflow, any ideas about virtualization is also welcome.
TensorFlow generally assumes it's not sharing GPU with anyone, so I don't see a way of doing it from inside TensorFlow. However, you could do it from outside as follows -- shell script that calls nvidia-smi, parses out GPU k with more memory, then sets "CUDA_VISIBLE_DEVICES=k" and calls TensorFlow script
Inspired by:
How to set specific gpu in tensorflow?
def leave_gpu_with_most_free_ram():
try:
command = "nvidia-smi --query-gpu=memory.free --format=csv"
memory_free_info = _output_to_list(sp.check_output(command.split()))[1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
least_busy_idx = memory_free_values.index(max(memory_free_values))
# update CUDA variable
gpus =[least_busy_idx]
setting = ','.join(map(str, gpus))
os.environ["CUDA_VISIBLE_DEVICES"] = setting
print('Left next %d GPU(s) unmasked: [%s] (from %s available)'
% (leave_unmasked, setting, str(available_gpus)))
except FileNotFoundError as e:
print('"nvidia-smi" is probably not installed. GPUs are not masked')
print(e)
except sp.CalledProcessError as e:
print("Error on GPU masking:\n", e.output)
Add a call to this function before importing tensorflow