I have a network implemented in TensorFlow that takes very long to train and therefore want to profile it to see which parts cause the long runtime.
To do that, I follow the instructions here to capture runtime and memory information. My code looks like this:
// define network
loss = ...
train_op = tf.train.AdamOptimizer().minimize(loss, global_step=global_step)
// run forward and backward prop for one batch
run_metadata = tf.RunMetadata()
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
_,loss,sum = sess.run([train_op,loss,sum], feed_dict=fd, options=options, run_metadata=run_metadata)
writer.add_run_metadata(run_metadata, 'step_%d' % step)
I can then see "session runs" in TensorBoard. However, as soon as I load a session run, most operations in my graph turn orange as shown below and no runtime or memory information is available for them:
According to the legend, these operations are "unsused". But that cannot be the case, as almost everything except "loss" and "opt" are shown like that. Clearly, the whole network has to be used to compute the loss. So I don't really see why the graph is shown like this.
I use TF 1.3 on a Tesla K40c.
I used to have the same problem as you with Tensorboard not registering anything in my session run except the gradient and optimizer ops.
I fixed it by upgrading my version of Tensorflow to the 1.4 release.
Not sure.
Try adding this line
writer.add_summary(_, step)
after
writer.add_run_metadata...
Related
I am struggling with the following. I am creating a tf.data.Dataset using the from_generator method. I perform these actions on CPU as I don't want to overload my GPU memory.
The dataset consists of tuples, which contain a tf.bool 1-D mask (tf.Tensor) with fixed length, and a tf.float 2-D matrix (tf.Tensor) with variable size. The loss function is decorated using the following decorator, so I would not assume the variable size is the problem.
#tf.function(experimental_relax_shapes=True)
Ideally, the dataset is kept on the CPU, but then prefetched onto the GPU.
def gen():
for i, j in zip(mask_list, wmat_list):
yield i, j
dataset = tf.data.Dataset.from_generator(gen, output_types=(tf.bool, tf.float32))
The main training loop currently relies on tf.identity to move the data to the gpu, which is inefficient. As shown on the screenshot from Tensorboard below. Roughly 70% of the time is spend loading the data and moving it to GPU.
for b, (mask, wmat) in enumerate(dataset):
with tf.GradientTape() as tape:
mask = tf.identity(mask)
wmat = tf.identity(wmat)
mean_error, loss = self.model.loss(mask, wmat)
epoch_loss += loss.numpy()
epoch_mean_error += mean_error.numpy()
I have tried the "prefetch_to_device" function. However, it did not move the data onto the GPU. As verified by printing e.g. mask.device in the training loop.
gpu_transform = tf.data.experimental.prefetch_to_device('/gpu')
dataset.apply(gpu_transform)
For me it resembles to this bug: https://github.com/tensorflow/tensorflow/issues/30929 . However, it is marked as solved and is over a year old.
Running TF 2.3 using the official Docker image.
I have found the solution to my own question.
The problem was that the tuples in the dataset did not contain tf.Tensors, but numpy arrays. Therefore, the pipeline was probably limited by the functionality of py_func().
The screenshot below show that the pipeline does not block on the CPU. However there is still a considerable MemCpy. The prefetch_to_device() still does not do anything. This is likely due to a known issue which should be fixed in TF2.4
https://github.com/tensorflow/tensorflow/issues/35563
The (unconfirmed) suggested workaround also did not work for me. (see edit)
with tf.device("/gpu:0"):
ds = ds.prefetch(1)
EDIT:
I have further investigated this issue and filed a bug report. It does now seem that the suggested workaround does something, but not sure if it completely prefetches in time.
https://github.com/tensorflow/tensorflow/issues/43905
It's been days that I've been struggling just to simply view layers' gradients in the debug mode of Keras2. Needless to say, I have already tried codes such as:
import Keras.backend as K
gradients = K.gradients(model.output, model.input)
sess = tf.compat.v1.keras.backend.get_session()
evaluated_gradients = sess.run(gradients, feed_dict={model.input:images})
or
evaluated_gradients = sess.run(gradients, feed_dict{model.input.experimantal_ref():images})
or
with tf.compat.v1.Session(graph=tf.compat.v1.keras.backend.get_default_graph())
or similar approaches using
tf.compat.v1
which all lead to the following error:
RuntimeError: The Session graph is empty. Add operations to the graph
before calling run().
I assume this should be the most basic tool any deep learning package could provide, it is strange why there seems no easy way to do so in Keras2. Any ideas?
You can try to do this on TF 2 with eager mode on.
Please notice that you need to use tf.keras for everything, including your model, layers, etc. For this to work you can never use keras alone, it must be tf.keras. This means, for instance, using tf.keras.layers.Dense, tf.keras.models.Sequential, etc..
input_images_tensor = tf.constant(input_images_numpy)
with tf.GradientTape() as g:
g.watch(input_images_tensor)
output_tensor = model(input_images_tensor)
gradients = g.gradient(output_tensor, input_images_tensor)
If you are going to calculate the gradients more than once with the same tape, you need the tape to be persistent=True and delete it manually after you get the gradients. (See details on the link below)
You can get the gradients regarding any "trainable" weight without needing watch. If you are going to get gradients with respect to non-trainable tensors (such as the input images), then you must call g.watch as above for each of these variables).
More details on GradientTape: https://www.tensorflow.org/api_docs/python/tf/GradientTape
I'm trying to run the model scoring (inference graph) from tensorflow objec detection API to run it on multiple GPU's, tried specifying the GPU number in the main, but it runs only on single GPU.placed GPU utilization snapshot here
Using tensorflow-gpu==1.13.1, can you kindly point me what I'm missing here.
for i in range(2):
with tf.device('/gpu:{}' . format(i)):
tf_init()
init = tf.global_variables_initializer
with detection_graph.as_default():
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session:
call to #run_inference_multiple_images function
The responses to this question should give you a few options for fixing this.
Usually TensorFlow will occupy all visible GPUs unless told otherwise. So if you haven't already tried, you could just remove the with tf.device line (assuming you only have the two GPUs) and TensorFlow should use them both.
Otherwise, I think the easiest is setting the environment variables with os.environ["CUDA_VISIBLE_DEVICES"] = "0,1".
In order to use the contrib.learn.Estimator for multi-GPU training, I am attempting to specify GPU assignments in my model_fn.
In pseudo-code:
def model_fn(X, y):
with tf.device('/gpu:1'):
... various tensorflow ops for model ...
return predictions, loss, train_op
Everything works fine without the tf.device('/gpu:1') call, but with it I encounter the following error:
InvalidArgumentError (see above for traceback): Cannot assign a device to
node 'save/ShardedFilename_1': Could not satisfy explicit device
specification '/device:GPU:1' because no supported kernel
for GPU devices is available.
I do not believe that I am adding the offending op to the graph myself, but rather that it is injected through the Estimator's snapshot functionality.
I believe that the solution is to set allow_soft_placement=True so that non GPU functions will fall to CPU, but it's not obvious to me how that exposed when dealing with contrib.learn.Estimator.
I see that the option is usually set in ConfigProto & passed to the session, but I've been using the Estimator's functionality to manage the session for me. Should I be taking control of the session creation, or am I missing a parameter somewhere to accomplish this?
Many thanks in advance for any advice.
Along with Estimator leaving contrib in Tensorflow 1.0 this is fixed.
I'd like to investigate device placement in the tensorboard using the following code for generating the graph in the summary
# Build the summary operation based on the TF collection of Summaries.
summary_op = tf.merge_all_summaries()
saver = tf.train.Saver(tf.all_variables())
summary_writer = tf.train.SummaryWriter(log_directory, graph_def=sess.graph_def)
This works for displaying the graph and summaries defined in the graph. But when selecting 'device placement' in the tensorboard, all nodes are assigned to 'unknown device'. Do I need to dump the device placement in some other way?
The TensorBoard graph visualizer only sees the explicit device assignments that you have made in your program (i.e. those made using with tf.Device("..."): blocks).
The reason for this is that the nodes in a TensorFlow graph are assigned to devices in multiple stages. The first stage, in the client (e.g. your Python program) allows you to explicitly—and optionally—assign devices to each node, and it is the output of this stage that is written to the TensorBoard logs. A later placement stage runs inside the TensorFlow backend, and assigns every node to a device.
I suspect you want to analyze the results of the later placement stage. Currently there is no support for this in TensorBoard, but you can extract some information by creating the tf.Session as follows:
sess = tf.Session(config=tf.ConfigProto(
log_device_placement=True))
…and then the device placement decisions will be logged to stderr.