what's the difference between a ref edge and non-ref edge in tensorflow? - tensorflow

As the title says, I'm wondering about the conceptual difference between a "ref edge" and "non-ref edge" in TensorFlow.
I'm reading the graph partitioning algorithm in TensorFlow.
Here (line 826 of graph_partition.cc) is the comment which mentions
the "non-ref edge":
825 // For a node dst, 'ref_recvs' remembers the recvs introduced by a ref
826 // edge to dst. 'ref_control_inputs' remembers the inputs by a non-ref
827 // edge to dst. We will add a control edge for every pair in
828 // (ref_recvs x ref_control_inputs).
829 std::vector<NodeDef*> ref_recvs;
830 std::vector<string> ref_control_inputs;
Can someone explain the difference more clearly? Thanks very much.

In TensorFlow, most edges are "non-ref" edges, which means that the value flowing along that edge is a constant. If you think of a vertex (operation) in a TensorFlow graph as a function, you can think of a non-ref edge as representing a function argument that is passed by value in a conventional programming language like C or C++. For example, the inputs to and outputs from the operation z = tf.matmul(x, y) are all non-ref edges.
A "ref edge" in TensorFlow allows the value flowing along that edge to be mutated. Continuing the function analogy, a ref edge represents a function argument that is passed by reference (from which we take the name "ref" edge). The most common use of ref edges is in the current internal implementation of tf.Variable: the internal Variable kernel owns a mutable buffer, and outputs a reference to that buffer on a ref edge. Operations such as tf.assign(var, val) expect their var argument to be passed along ref edge, because they need to mutate the value in var.
The graph partitioning algorithm treats ref edges specially because they correspond to values that could change as the graph executes. Since a non-ref edge is a constant value, TensorFlow can assume that all non-ref edges out of the same operation that cross between two devices can be combined into a single edge, which saves on network/memory bandwidth. Since the value on a ref edge can change (e.g. if a variable is updated in the middle of a step), TensorFlow must be careful not to combine the edges, so that the remote device can see the new value. By analogy with C/C++, the TensorFlow graph partitioner treats a ref-edge as representing a volatile variable, for the purposes of optimization.
Finally, as you can tell from the amount of explanation above, ref edges are quite complicated, and there is an ongoing effort to remove them from the TensorFlow execution model. The replacement is "resource-typed edges", which allow non-tensor values to flow along an edge (unifying variables, queues, readers, and other complex objects in TensorFlow), and explicit operations that take a variable resource as input and read its value (as a non-ref output edge). The implementation of the new "resource variables" can be seen here in Python and here in C++.

Related

How to interpret a confusion matrix in yolov5 for single class?

I used the yolov5 network for object detection on my database which has only one class. But I do not understand the confusion matrix. Why FP is one and TN is zero?
You can take a look at this issue. In short, confusion matrix isn't the best metric for object detection because it all depends on confidence threshold. Even for one class detection try to use mean average precision as the main metric.
We usually use a confusion matrix when working with text data. FP (False Positive) means the model predicted YES while it was actually NO. FN (False Negative) means the model predicted NO while it was actually YES and TN means the model predicted NO while it was actually a NO. In object detection, normally mAP is used (mean Average Precision). Also, you should upload a picture of the confusion matrix, only then the community would be able to guide you on why your FP is one and TN is zero.
The matrix indicates that 100% of the background FPs are caused by a single class, which means detections that do not match with any ground truth label. So that it shows 1. Background FN is for the ground truth objects that can not be detected by the mode, which shows empty or null.

What the ImageNet Vid policy over evaluation of frames with zero object inside

Ι am trying to evaluate my video object detection module and I am using InageNet VID dataset for this purpose. At some point I am facing the case to evaluate a frame containing zero objects. Meaning there are no ground truth bboxes in this frame (this is fine since we are talking about video object detection).
Since, the module I am using expected at least 1 bbox to be present I was wondering what's the official treatment for these case by ImageNet. I found this description which obviously not exhaustive one may provide some point in ImageNet site which states:
The evaluation metric is the same as for the objct detection task,
meaning objects which are not annotated will be penalized, as will
duplicate detections (two annotations for the same object instance).
(sic; typo is from the original text)
Which does not mention the above case scenario. Since this is a simple description I am not sure it covers every edge case. Normally in single image object detection this is not an issues since evaluation samples always contain some object. But in this case does this mean I should ignore those frames for example altogether?
Also, checking this repository about object detection metric (which is super analytic by the way) the no gt case seems to fall into the general scenario about False Positive (FP). In this case Intersection would be 0 (since no gt bbox exists) and Union would be just a non zero number equal to the FP bbox and so, IoU = 0.
So, how does the official ImageNet deal with these cases? I am not interested in what is reasonable choice here, just the official version.
I have just looked through the ImageNet VID 2015 evaluation code, which I got from the evaluation kit from UNC.
The evaluation deals with precision and recall, and so needs to calculate TP, FP and FN for every GT box/detection pair or instance. The IoU calculation is used purely to determine whether a valid detection has taken place.
For frames with no GT boxes, and no detections: As we are not recording true negatives these make no difference to the calculation.
For frames with no GT boxes, but some detections: These false positives are captured for each frame on line 231 of eval_vid_detections.m:
if kmax > 0
tp(j) = 1;
gt_detected(kmax) = 1;
else
fp(j) = 1;
end
For frames with GT boxes, but no detections: These GT boxes are counted when the GT data is first loaded on line 79: num_pos_per_class(c) = num_pos_per_class(c) + 1;. This is later used when calculating the recall on line 266: recall{c}=(tp/num_pos_per_class(c))';
So, if your frame contains no detections and no GT boxes, you can safely ignore it.
As an aside, note that the per instance detection threshold is set like this:
thr = (gt_w.*gt_h)./((gt_w+pixelTolerance).*(gt_h+pixelTolerance));
gt_obj_thr{i} = min(defaultIOUthr,thr);
where pixelTolerance = 10. This gives a bit of a boost to small objects.
What about the case where there is a missing annotation for an object which should have had a GT annotation ?
For example ILSVRC2015_val_00000001/000266.JPEG clearly has a turtle (in fact up until 000265.JPEG all the frames has the corresponding annotation of turtle) but the corresponding annotation file ILSVRC2015_val_00000001/000266.xml doesn't have any annotation.
In my analysis, there are 4046 frames out of 176126 in validation dataset which have missing GT annotation. Most of these frames have no GT annotation despite containing an object which belongs to one of the 30 categories of ImageNetVID.

What does "embedding" mean in context of low-level tensorflow?

I meet the following sentence on page https://www.tensorflow.org/tutorials/eager/custom_training:
For Variables representing embeddings TensorFlow will do sparse
updates by default, which are more computation and memory efficient.
and completely don't understand it. Please, explain it for me!
An embedding is a mapping from a discrete domain to a vector of real numbers.
Just image to have a tf.Variable with shape [key, value] where key is the dimension of the discrete domain (e.g. a list of words) and value is the dimension of the vector that represents key (usually this vector is a latent representation of the key).
Tensorflow represents the embedding mapping using a variable and it allows to access the key element to get the corresponding value using tf.nn.embedding_lookup.
Hence, instead of working on the complete embedding (that's a huge variable) tensorflow is able to access only to the key element of that variable (sparse access), being more efficient.

Does TensorFlow gradient compute derivative of functions with unknown dependency on decision variable

I appreciate if you can answer my questions or provide me with useful resources.
Currently, I am working on a problem that I need to do alternating optimization. So, consider we have two decision variables x and y. In the first step I take the derivative of loss function wrt. x (for fixed y) and update x. On the second step, I need to take the derivative wrt. y. The issue is x is dependent on y implicitly and finding the closed form of cost function in a way to show the dependency of x on y is not feasible, so the gradients of cost function wrt. y are unknown.
1) My first question is whether "autodiff" method in reverse mode used in TensorFlow works for these problems where we do not have an explicit form of cost function wrt to one variable and we need the derivatives? Actually, the value of cost function is known but the dependency on decision variable is unknown via math.
2) From a general view, if I define a node as a "tf.Variable" and have an arbitrary intractable function(intractable via computation by hand) of that variable that evolves through code execution, is it possible to calculate the gradients via "tf.gradients"? If yes, how can I make sure that it is implemented correctly? Can I check it using TensorBoard?
My model is too complicated but a simplified form can be considered in this way: suppose the loss function for my model is L(x). I can code L(x) as a function of "x" during the construction phase in tensorflow. However, I have also another variable "k" that is initialized to zero. The dependency of L(x) on "k" shapes as the code runs so my loss function is L(x,k), actually. And more importantly, "x" is a function of "k" implicitly. (all the optimization is done using GradientDescent). The problem is I do not have L(x,k) as a closed form function but I have the value of L(x,k) at each step. I can use "numerical" methods like FDSA/SPSA but they are not exact. I just need to make sure as you said there is a path between "k" and L(x,k)but I do not know how!
TensorFlow gradients only work when the graph connecting the x and the y when you're computing dy/dx has at least one path which contains only differentiable operations. In general if tf gives you a gradient it is correct (otherwise file a bug, but gradient bugs are rare, since the gradient for all differentiable ops is well tested and the chain rule is fairly easy to apply).
Can you be a little more specific about what your model looks like? You might also want to use eager execution if your forward complication is too weird to express as a fixed dataflow graph.

taking the gradient in Tensorflow, tf.gradient

I am using this function of tensorflow to get my function jacobian. Came across two problems:
The tensorflow documentation is contradicted to itself in the following two paragraph if I am not mistaken:
gradients() adds ops to the graph to output the partial derivatives of ys with respect to xs. It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys.
Blockquote
Blockquote
Returns:
A list of sum(dy/dx) for each x in xs.
Blockquote
According to my test, it is, in fact, return a vector of len(ys) which is the sum(dy/dx) for each x in xs.
I do not understand why they designed it in a way that the return is the sum of the columns(or row, depending on how you define your Jacobian).
How can I really get the Jacobian?
4.In the loss, I need the partial derivative of my function with respect to input (x), but when I am optimizing with respect to the network weights, I define x as a placeholder whose value is fed later, and weights are variable, in this case, can I still define the symbolic derivative of function with respect to input (x)? and put it in the loss? ( which later when we optimize with respect to weights will bring second order derivative of the function.)
I think you are right and there is a typo there, it was probably meant to be "of length len(ys)".
For efficiency. I can't explain exactly the reasoning, but this seems to be a pretty fundamental characteristic of how TensorFlow handles automatic differentiation. See issue #675.
There is no straightforward way to get the Jacobian matrix in TensorFlow. Take a look at this answer and again issue #675. Basically, you need one call to tf.gradients per column/row.
Yes, of course. You can compute whatever gradients you want, there is no real difference between a placeholder and any other operation really. There are a few operations that do not have a gradient because it is not well defined or not implemented (in which case it will generally return 0), but that's all.