I would like to use Tensorboard to visualize the evolution of the loss over a validation sample. But the validation set is too large to compute in one minibatch. Therefore, to compute my validation loss, I have to call session.run several times over several minibatches covering the validation set. Then I sum the loss (in python) of each minibatches to obtain the full validation loss.
My problem is that tf.scalar_summary seems to have to be attached to a tensorflow node. But I would need to somehow "attach" it to the sum of the values of a node over several run of session.run.
Is there a way to do that? Maybe by directly summarizing the python float that contains the sum of the minibatch losses? But I have not seen in the docs a way to "summarize" for tensorboard a python value that is outside of a computation. The example in the "How-To" section of the doc is only concerned with losses that can be computed in a single call to session.run.

You could add a Variable that is updated on each sess.Run call and have the summary track the value of the Variable.


What does stateful mean in tensorflow metrics in my case?

I don't really understand the explanation of a stateful metric here: Keras metrics with TF backend vs tensorflow metrics
Now, if I split my evaluation data in batches and for each batch I use tf.metrics.precision for the precision, does it mean that the previous variables (counter false positives etc. ) are used for the calculation in the next batch? That would be really bad, since I want the single evaluations for each batch (that is why I do the split!)
If this is the case how can I reset the variables for each batch.
I need the single values from each batch for a mean afterwards.
The reason why tf.metrics.Precision and the like (Recall, etc) store true/false positive is because we do not want to estimate them batch-wise (unlike Accuracy or Loss, etc). The original implementation of Precision in keras (noted, not tf.keras) did exactly what you described (single evaluations for each batch and then aggregate afterward) but was later removed in version 2.0.0 because this way of computing global metric is "more misleading than helpful" (https://github.com/keras-team/keras/issues/5794).
But you may still do what you want to do, you can subclass tf.metrics.Metric and implement the logic of Precision in update_state method. The Metric API doc on Tensorflow has an example of custom Metrics. https://www.tensorflow.org/api_docs/python/tf/keras/metrics/Metric
I hope this is helpful!

Is there a way to measure the back-ward pass of a model?

There is a relevant question here already TensorFlow: Is there a way to measure FLOPS for a model?
However, the answer given by #Tobias Scheck is the forward pass stats.
Is there a way to measure/estimate the backward pass as well?
If you just want to get a quick number, you can simply add
grads = tf.gradients(C, [A, B])
to #Tobias Scheck's code to construct the gradient computation nodes. Then, subtract the new number (with gradient ops) from the original one (without gradient ops) to get the estimated flops.
A word of caution about using this method in larger projects. This method uses static analysis of the whole graph. This has a few problems including:
The flops from ops in a while loop will be added only once.
Ops that are never normally run (some TF functionalities can leave garbage ops in the graph) will be added.
This analysis heavily depends on shape inference. It might not be available for all ops.
This analysis depends on registering functions that can estimate the flops of a given op. There can be ops without such functions and such functions don't precisely model the flops done by the actual kernel your TF will pick to execute the op.
For more info see: https://github.com/tensorflow/tensorflow/blob/r1.8/tensorflow/core/profiler/g3doc/profile_model_architecture.md
It is better to use this in conjunction with an actual run record (RunMetadata) or use a purely runtime based approach, e.g. Can I measure the execution time of individual operations with TensorFlow?, and do some filtering/aggregation on the results.

Tensorflow input pipeline

I have an input pipeline where samples are generated on fly. I use keras and custom ImageDataGenerator and corresponding Iterator to get samples in memory.
Under assumption that keras in my setup is using feed_dict (and that assumption is a question to me) I am thinking of speeding things up by switching to raw tensorflow + Dataset.from_generator().
Here I see that suggested solution for input pipelines that generate data on fly in the most recent Tensorflow is to use Dataset.from_generator().
Does keras with Tensorflow backend use feed_dict method?
If I switch to raw tensorflow + Dataset.from_generator(my_sample_generator) will that cut feed_dict memory copy overhead and buy me performance?
During predict (evaluation) phase apart from batch_x, batch_y I have also opaque index vector from my generator output. That vector corresponds to sample ids in the batch_x. Does that mean that I'm stuck with feed_dict approach for predict phase because I need that extra batch_z output from iterator?
The new tf.contrib.data.Dataset.from_generator() can potentially speed up your input pipeline by overlapping the data preparation with training. However, you will tend to get the best performance by switching over to TensorFlow ops in your input pipeline wherever possible.
To answer your specific questions:
The Keras TensorFlow backend uses tf.placeholder() to represent compiled function inputs, and feed_dict to pass arguments to a function.
With the recent optimizations to tf.py_func() and feed_dict copy overhead, I suspect the amount of time spent in memcpy() will be the same. However, you can more easily use Dataset.from_generator() with Dataset.prefetch() to overlap the training on one batch with preprocessing on the next batch.
It sounds like you can define a separate iterator for the prediction phase. The tf.estimator.Estimator class does something similar by instantiating different "input functions" with different signatures for training and evaluation, then building a separate graph for each role.
Alternatively, you could add a dummy output to your training iterator (for the batch_z values) and switch between training and evaluation iterators using a "feedable iterator".

Unaggregated gradients / gradients per example in tensorflow

Given a simple mini-batch gradient descent problem on mnist in tensorflow (like in this tutorial), how can I retrieve the gradients for each example in the batch individually.
tf.gradients() seems to return gradients averaged over all examples in the batch. Is there a way to retrieve gradients before aggregation?
Edit: A first step towards this answer is figuring out at which point tensorflow averages the gradients over the examples in the batch. I thought this happened in _AggregatedGrads, but that doesn't appear to be the case. Any ideas?
tf.gradients returns the gradient with respect to the loss. This means that if your loss is a sum of per-example losses, then the gradient is also the sum of per-example loss gradients.
The summing up is implicit. For instance if you want to minimize the sum of squared norms of Wx-y errors, the gradient with respect to W is 2(WX-Y)X' where X is the batch of observations and Y is the batch of labels. You never explicitly form "per-example" gradients that you later sum up, so it's not a simple matter of removing some stage in the gradient pipeline.
A simple way to get k per-example loss gradients is to use batches of size 1 and do k passes. Ian Goodfellow wrote up how to get all k gradients in a single pass, for this you would need to specify gradients explicitly and not rely on tf.gradients method
To partly answer my own question after tinkering with this for a while. It appears that it is possible to manipulate gradients per example while still working in batch by doing the following:
Create a copy of tf.gradients() that accepts an extra tensor/placeholder with example-specific factors
Create a copy of _AggregatedGrads() and add a custom aggregation method that uses the example-specific factors
Call your custom tf.gradients function and give your loss as a list of slices:
ys=[cross_entropy[i] for i in xrange(batch_size)],
But this will probably have the same complexity as doing individual passes per example, and I need to check if the gradients are correct :-).
One way of retrieving gradients before aggregation is to use the grads_ys parameter. A good discussion is found here:
Use of grads_ys parameter in tf.gradients - TensorFlow
I haven't been working with Tensorflow a lot lately, but here is an open issue tracking the best way to compute unaggregated gradients:
There is a lot of sample code solutions provided by users (including myself) that you can try based on your needs.

Caching Computations in TensorFlow

Is there a canonical way to reuse computations from a previously-supplied placeholder in TensorFlow? My specific use case:
supply many inputs (using one placeholder) simultaneously, all of which are fed through a network to obtain smaller representations
define a loss based on various combinations of these smaller representations
train on one batch at a time, where each batch uses some subset of the inputs, without recomputing the smaller representations
Here is the goal in code, but which is defective because the same computations are carried out again and again:
X_in = some_fixed_data
combinations_in = large_set_of_combination_indices
for combination_batch_in in batches(combinations_in, batch_size=128):
session.run(train_op, feed_dict={X: X_in, combinations: combination_batch_in})
The canonical way to share computed values across sess.Run() calls is to use a Variable. In this case, you could set up your graph so that when the Placeholders are fed, they compute a new value of the representation that is saved into a Variable. A separate portion of the graph reads those Variables to compute the loss. This will not work if you need to compute gradients through the part of the graph that computes the representation. Computing those gradients will require recomputing every Op in the encoder.
This is the kind of thing that should be solved automatically with CSE (common subexpression elimination). Not sure what the support in TensorFlow right now, might be kind of spotty, but there's optimizer_do_cse flag for Graph options which is defaulting to false, and you can set it to true using GraphConstructorOptions. Here's a C++ example of using GraphConstructorOptions (sorry, couldn't find a Python one)
If that doesn't work, you could do "manual CSE", ie, figure out which part is being needlessly recomputed, factor it out into separate Tensor, and reference that tensor in all the calculations.