Not passing a value for a placeholder in tensorflow, why isn't it allowed? - tensorflow

I had 3 models that use the same input but produce 3 unique outputs (1 classifier, 2 regression). 2 of the 3 I combined into 1 model with 2 loss functions and saw a significant improvement in accuracy/RMSE.
I'm trying to combine the 3rd loss function into the model, so I have 1 model with 3 loss functions that share many parameters.
The 3rd loss function only applies to half the data though. I tested standardizing the labels to 0-mean-unit-variance and using 0 for the labels where they don't apply to loss function C, but that biased results towards 0 in some cases.
I'm now experimenting with alternating optimization on loss functions A & B together with a batch from the full dataset, vs all 3 loss functions A, B, & C with a batch appropriate for loss C (and A&B). In the context of my problem this is logical to do.
My Question:
Tensorflow requires all placeholders that are defined in the graph to be passed in. However, I'm not using that tensor in this particular optimization step. Is this expected behavior? And should I just pass in a dummy variable to appease TF here? I wonder if I'm missing an important detail.

The dependency was with tensorboard, I had a summary operation on all loss functions, forcing them to be executed.
I split out my summary operations into groups using tf.add_to_collection() to gather different summary ops, then used a for loop to add them to the list of tensors to process as appropriate.
It was that and one other dependency that was just a bug that I found. #Sygi and #Fake are correct, you shouldn't need to pass in a value that isn't used in a particular computation just because it exists in the graph.

Related

How to calculate the KL divergence for two multivariate pandas dataframes

I am training a Gaussian-Process model iteratively. In each iteration, a new sample is added to the training dataset (Pandas DataFrame), and the model is re-trained and evaluated. Each row of the dataset comprises 5 independent variables + the dependent variable. The training ends after 150 iterations (150 samples), but I want to extend this behaviour so the training can automatically stop after a number of iterations for which no meaningful information is added to the model.
My first approach is to compare the distribution of the last 10 samples to the previous 10. If the distributions are very similar, I assume that not meaningful knowledge has been added in the last 10 iterations, so I abort the training.
I thought of using Kullback-Leibler divergence, but I am not sure if this can be used for multivariate distributions. Should I use it? If so, how?
Additionally, is there any other better/smarter way to proceed?
Thanks

Is there a way to measure the back-ward pass of a model?

There is a relevant question here already TensorFlow: Is there a way to measure FLOPS for a model?
However, the answer given by #Tobias Scheck is the forward pass stats.
Is there a way to measure/estimate the backward pass as well?
If you just want to get a quick number, you can simply add
grads = tf.gradients(C, [A, B])
to #Tobias Scheck's code to construct the gradient computation nodes. Then, subtract the new number (with gradient ops) from the original one (without gradient ops) to get the estimated flops.
A word of caution about using this method in larger projects. This method uses static analysis of the whole graph. This has a few problems including:
The flops from ops in a while loop will be added only once.
Ops that are never normally run (some TF functionalities can leave garbage ops in the graph) will be added.
This analysis heavily depends on shape inference. It might not be available for all ops.
This analysis depends on registering functions that can estimate the flops of a given op. There can be ops without such functions and such functions don't precisely model the flops done by the actual kernel your TF will pick to execute the op.
For more info see: https://github.com/tensorflow/tensorflow/blob/r1.8/tensorflow/core/profiler/g3doc/profile_model_architecture.md
It is better to use this in conjunction with an actual run record (RunMetadata) or use a purely runtime based approach, e.g. Can I measure the execution time of individual operations with TensorFlow?, and do some filtering/aggregation on the results.

Keras model returns different values

To play with data, I have trained a linear regression with Keras+TensorFlow, and compared the first prediction computed in 3 different ways:
I got the weights from the model, and just used the linear regression formula p = w*X0 + b
I got predictions using the model.predict(X) method of Keras for the whole data array X and then took only the first element of it
I got prediction using the same method only for the first row of features X0 (the first sample)
In theory, all those methods should produce the very same value. However, in practice I do get values that are a bit different.
This difference is not that big, but still I wonder why is that the case, only due to float precision in python?
This is most likely due to the fact that matrix multiplications and convolutions are implemented in a way which is non-deterministic (if you change the batch size you change the order in which multiply-adds happen and since floating point numbers are not associative you get slightly different results).

How to compute a per-class parameter from a mini-batch in TensorFlow?

I am starting to learn TensorFlow and I have a seemingly simple modeling question. Suppose I have a C-class problem and data arrives into TensorFlow in mini-batches containing B samples each. Each sample x is a D-dimensional vector that comes with its label y (non-negative integer between 0 and C-1). I want to estimate a class-specific parameter (for example the sample mean) for each class. The estimation takes place after each sample independently undergoes a TensorFlow-defined transformation pipeline. The per-class parameter/sample-mean is then utilized in the computation of other tensors.
Intuitively, I would group the samples in each mini-batch by label, sum-combine them, and add the total of each label group to the corresponding class parameter, with appropriate normalization.
How can I implement such a simple procedure (group by label, perform a per-group operation, then use the labels as indices for writing into a tensor) or an equivalent one, using TensorFlow? What TensorFlow operations do I need to learn about to achieve this? Is it advisable to do it outside TensorFlow?

Can TensorFlow cache (sub-)graph computations?

Can TensorFlow automatically cache computations if they involve
multiple calls to the same computation (sub-)graph?
For example, I have a matrix F in which each entry represents a
computation based on trainable variables W. My objective function
multiplies this matrix several times with different vectors (each
time with unchanged W).
Will TensorFlow recompute, for example, F[1,2] whenever I access
it, or will it cache that value?
In theory, one could precompute the matrix F given a fixed W,
such that each entry in F is a tf.constant. But that would
prevent the correct computation of the gradients of W.
TensorFlow performs a limited amount of caching, but it probably doesn't cover the case that you describe.
If you create a tf.Session with the following options, constant folding will be enabled:
config = tf.ConfigProto(graph_options=tf.GraphOptions(
optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L2)))
sess = tf.Session(config=config)
When you call sess.run() with this configuration, TensorFlow will evaluate the appropriate nodes to run, then identify the subgraph of those nodes whose outputs are constant, evaluate them, and cache the results. Therefore, it will avoid re-executing redundant computation.
However, in your question you mention that F is a function of some trainable variables. From TensorFlow's point of view, these variables are volatile—they may change at any time—so it does not cache values that are derived from these variables. If you want to reuse the same value for F multiple times, you could consider storing it in a tf.constant() so that the constant folding optimization is more useful.