I'm trying to find the source code for TensorFlow's low level linear-algebra and matrix arithmetic operators for execution on CPU. For example, where is the actual implementation of tf.add() for execution on a CPU? As far as I know, most linear algebra operators are actually implemented by Eigen, but I'd like to know what Eigen functions specifically are being called.
I've tried tracing back from the high-level API, but this is difficult as there are a lot of steps between placing an operator on the graph, and the actual execution of the operator by the TF runtime.
The implementation is hidden behind some meta-template programming (not unusual for Eigen).
Each operation in TensorFlow is registered at some point. Add is registered here and here.
REGISTER3(BinaryOp, GPU, "Add", functor::add, float, Eigen::half, double);
The actual implementation of Operations is based on OpKernel. The Add operation is implemented in BinaryOp::Compute The class hierarchy would be BinaryOp : BinaryOpShared : OpKernel
In the case of adding two scalars, the entire implementation is just:
functor::BinaryFunctor<Device, Functor, 1>().Right(
eigen_device, out_flat, in0.template flat<Tin>(),
in1.template scalar<Tin>(), error_ptr);
where in0, in1 are the incoming Tensor-Scalars, Device is either GPU or CPU, and Functor is the operation itself. The other lines are just for performing the broadcasting.
Scroll down in this file and expanding the REGISTER3 macro explains how the arguments are passed from REGISTER3 to functor::BinaryFunctor<Device, Functor, ...>.
You cannot expect to see some loops as Eigen use Expressions to do Lazy Evaluation and Aliasing. The Eigen-"Call" is here:
https://github.com/tensorflow/tensorflow/blob/7a0def60d45c1841a4e79a0ddf6aa9d50bf551ac/tensorflow/core/kernels/cwise_ops.h#L693-L696
Related
When working with Metal, I find there's a bewildering number of types and it's not always clear to me which type I should be using in which context.
In Apple's Metal Shading Language Specification, there's a pretty clear table of which types are supported within a Metal shader file. However, there's plenty of sample code available that seems to use additional types that are part of SIMD. On the macOS (Objective-C) side of things, the Metal types are not available but the SIMD ones are and I'm not sure which ones I'm supposed to be used.
For example:
In the Metal Spec, there's float2 that is described as a "vector" data type representing two floating components.
On the app side, the following all seem to be used or represented in some capacity:
float2, which is typedef ::simd_float2 float2 in vector_types.h
Noted: "In C or Objective-C, this type is available as simd_float2."
vector_float2, which is typedef simd_float2 vector_float2
Noted: "This type is deprecated; you should use simd_float2 or simd::float2 instead"
simd_float2, which is typedef __attribute__((__ext_vector_type__(2))) float simd_float2
::simd_float2 and simd::float2 ?
A similar situation exists for matrix types:
matrix_float4x4, simd_float4x4, ::simd_float4x4 and float4x4,
Could someone please shed some light on why there are so many typedefs with seemingly overlapping functionality? If you were writing a new application today (2018) in Objective-C / Objective-C++, which type should you use to represent two floating values (x/y) and which type for matrix transforms that can be shared between app code and Metal?
The types with vector_ and matrix_ prefixes have been deprecated in favor of those with the simd_ prefix, so the general guidance (using float4 as an example) would be:
In C code, use the simd_float4 type. (You have to include the prefix unless you provide your own typedef, since C doesn't have namespaces.)
Same for Objective-C.
In C++ code, use the simd::float4 type, which you can shorten to float4 by using namespace simd;.
Same for Objective-C++.
In Metal code, use the float4 type, since float4 is a fundamental type in the Metal Shading Language [1].
In Swift code, use the float4 type, since the simd_ types are typealiased to shorter names.
Update: In Swift 5, float4 and related types have been deprecated in favor of SIMD4<Float> and related types.
These types are all fundamentally equivalent, and all have the same size and alignment characteristics so you can use them across languages. That is, in fact, one of the design goals of the simd framework.
I'll leave a discussion of packed types to another day, since you didn't ask.
[1] Metal is an unusual case since it defines float4 in the global namespace, then imports it into the metal namespace, which is also exported as the simd namespace. It additionally aliases float4 as vector_float4. So, you can use any of the above names for this vector type (except simd_float4). Prefer float4.
which type should you use to represent two floating values (x/y)
If you can avoid it, don't use a single SIMD vector to represent a single geometry x,y vector if you're using CPU SIMD.
CPU SIMD works best when you have many of the same thing in each SIMD vector, because they're actually stores in 16-byte or 32-byte vector registers where "vertical" operations between two vectors are cheap (packed add or multiply), but "horizontal" operations can mostly only be done with a shuffle + a vertical operation.
For example a vector of 4 x values and another vector of 4 y values lets you do 4 dot-products or 4 cross-products in parallel with no shuffling, so the overall throughput is significantly more dot-products per clock cycle than if you had a vector of [x1, y1, x2, y2].
See https://stackoverflow.com/tags/sse/info, and especially these slides: SIMD at Insomniac Games (GDC 2015) for more about planning your data layout and program design for doing many similar operations in parallel instead of trying to accelerate single operations.
The one exception to this rule is if you're only adding / subtracting to translate coordinates, because that's still purely a vertical operation even with an array-of-structs. And thus fine for CPU short-vector SIMD based on 16-byte vectors. (e.g. the 2nd element in one vector only interacts with the 2nd element in another vector, so no shuffling is needed.)
GPU SIMD is different, and I think has no problem with interleaved data. I'm not a GPU expert.
(I don't use Objective C or Metal, so I can't help you with the details of their type names, just what the underlying CPU hardware is good at. That's basically the same for x86 SSE/AVX, ARM NEON / AArch64 SIMD, or PowerPC Altivec. Horizontal operations are slower.)
I am trying to implement a custom op in TensorFlow that represents a computationally heavy transfer function computed in C++ using Eigen on GPU. I would like to accelerate the computation of the gradient (also in C++ for speed) of the op by re-using some of the intermediate values obtained while computing its output.
In the source code of tensorflow/core/kernels/cwise_ops_gradients.h we see that many functions already do that to some extent by re-using the output of the op to compute its derivative. Here is the example of the sigmoid:
template <typename T>
struct scalar_sigmoid_gradient_op {
EIGEN_EMPTY_STRUCT_CTOR(scalar_sigmoid_gradient_op)
EIGEN_DEVICE_FUNC EIGEN_STRONG_INLINE const T
operator()(const T& output, const T& output_gradient) const {
return output_gradient * output * (T(1) - output);
}
However, I don't see how I can access something else than just the output, for example some other values I stored during the forward pass, to accelerate the computation of my derivative.
I've thought about adding a second output to my op, with all the data required for the derivative, and use it for the computation of the gradient of the actual output, but I've not managed to make it work yet. I'm not sure if it could work in principle.
Another approach I imagined is to manually modify the full graph (forward and backprop) to shortcut an output from the op directly towards its derivative block. I'm not sure how to do it.
Otherwise, there may be a data storage scheme I'm not aware of and that would allow me to store data in the forward pass of an op and retrieve it during gradient computation.
Thank you for your attention, I would greatly appreciate any ideas.
D
As the title says, I'm wondering about the conceptual difference between a "ref edge" and "non-ref edge" in TensorFlow.
I'm reading the graph partitioning algorithm in TensorFlow.
Here (line 826 of graph_partition.cc) is the comment which mentions
the "non-ref edge":
825 // For a node dst, 'ref_recvs' remembers the recvs introduced by a ref
826 // edge to dst. 'ref_control_inputs' remembers the inputs by a non-ref
827 // edge to dst. We will add a control edge for every pair in
828 // (ref_recvs x ref_control_inputs).
829 std::vector<NodeDef*> ref_recvs;
830 std::vector<string> ref_control_inputs;
Can someone explain the difference more clearly? Thanks very much.
In TensorFlow, most edges are "non-ref" edges, which means that the value flowing along that edge is a constant. If you think of a vertex (operation) in a TensorFlow graph as a function, you can think of a non-ref edge as representing a function argument that is passed by value in a conventional programming language like C or C++. For example, the inputs to and outputs from the operation z = tf.matmul(x, y) are all non-ref edges.
A "ref edge" in TensorFlow allows the value flowing along that edge to be mutated. Continuing the function analogy, a ref edge represents a function argument that is passed by reference (from which we take the name "ref" edge). The most common use of ref edges is in the current internal implementation of tf.Variable: the internal Variable kernel owns a mutable buffer, and outputs a reference to that buffer on a ref edge. Operations such as tf.assign(var, val) expect their var argument to be passed along ref edge, because they need to mutate the value in var.
The graph partitioning algorithm treats ref edges specially because they correspond to values that could change as the graph executes. Since a non-ref edge is a constant value, TensorFlow can assume that all non-ref edges out of the same operation that cross between two devices can be combined into a single edge, which saves on network/memory bandwidth. Since the value on a ref edge can change (e.g. if a variable is updated in the middle of a step), TensorFlow must be careful not to combine the edges, so that the remote device can see the new value. By analogy with C/C++, the TensorFlow graph partitioner treats a ref-edge as representing a volatile variable, for the purposes of optimization.
Finally, as you can tell from the amount of explanation above, ref edges are quite complicated, and there is an ongoing effort to remove them from the TensorFlow execution model. The replacement is "resource-typed edges", which allow non-tensor values to flow along an edge (unifying variables, queues, readers, and other complex objects in TensorFlow), and explicit operations that take a variable resource as input and read its value (as a non-ref output edge). The implementation of the new "resource variables" can be seen here in Python and here in C++.
It seems that tf.cond(cond, fn1, fn2) executes possible dependencies for both branches, so any computation we would like to perform if and only if the conditions hold have to be put into the function fn1 fn2.
However I am confused as to what fn actually is. Every variable/op in tensorflow should be a node of the computation graph, but fn is actually a python function. This leads to many questions. For example, is this function re-evaluated every time sess.run is executed? Can this function return different computation graphs each time? Can placeholders be defined in them, and if not how to avoid supplying values to placeholders we know will not be used when, for example, there is a switch variable that chooses between different inputs?
The functions passed to tf.cond are only run when the op is defined, not during graph execution. And both of them are run, exactly once as far as I can see. The functions themselves are just a way to indicate exactly which ops should have the conditional execution behavior: note the context_t.Enter()/context_t.Exit() calls surrounding each function call.
Hopefully that clarifies things. The functions are a useful way of grouping ops during graph definition. There's no function execution magic going on in the TensorFlow graph.
Do we have a GPU accelerated of version of numpy.max(X, axis=None) in Theano.
I looked into the documentation and found theano.tensor.max(X, axis=None), but it is 4-5 times slower than the numpy implementation.
I can assure you, it is not slow because of some bad choice of matrix size. Same matrix under theano.tensor.exp is 40 times faster than its numpy counterpart.
Any suggestions?
The previous answer is partial. The suggestion should not work, as the work around is the one used in the final compiled code. There is optimization that will do this transformation automatically.
The title of the question isn't the same as the content. They differ by the axis argument. I'll answer both questions.
If the axis is 0 or None we support this on the GPU for that operation for matrix. If the axis is None, we have a basic implementation that isn't well optimized as it is harder to parallelize. If the axis is 0, we have a basic implementation, but it is faster as it is easier to parallelize.
Also, how did you do your timing? If you just make one function with only that operation and test it via the device=gpu flags to do your comparison, this will include the transfer time between CPU and GPU. This is a memory bound operation, so if you include the transfer in your timming, personnaly I don't expect any speed op for that case. To see only the GPU operation, use Theano profiler: run with the Theano flag profile=True.
The max and exp operations are fundamentally different; exp (and other operations like addition, sin, etc.) is an elementwise operation that is embarrassingly parallelizable, while max requires a parallel-processing scan algorithm that basically builds up a tree of pairwise comparisons over an array. It's not impossible to speed up max, but it's not as easy as exp.
Anyway, the theano implementation of max basically consists of the following lines (in theano/tensor/basic.py):
try:
out = max_and_argmax(x, axis)[0]
except Exception:
out = CAReduce(scal.maximum, axis)(x)
where max_and_argmax is a bunch of custom code that, to my eye, implements a max+argmax operation using numpy, and CAReduce is a generic GPU-accelerated scan operation used as a fallback (which, according to the comments, doesn't support grad etc.). You could try using the fallback directly and see whether that is faster, maybe something like this:
from theano.tensor.elemwise import CAReduce
from theano.scalar import maximum
def mymax(X, axis=None):
CAReduce(maximum, axis)(X)