I need to get fastest possible matmul operation in TF for the case when one of the matrices is lower triangular. The cuBLAS and the BLAS have trmm functions, but looks like TensorFlow doesn't benefit from it.
I checked LinearOperators implementation for LowerTriangular case. But, it is not clear either it utilizes BLAS implementation or not.
Can anyone confirm that most optimized version is implemented by LinearOperators?
Thanks!
Related
I have a computation which has for loops and calls to Tensorflow matrix algorithms such as tf.lstsq and Tensorflow iteration with tf.map_fn. I would like to profile this to see how much parallelism I am getting in the tf.map_fn and matrix algorithms that get called.
This doesn't seem to be the use case at all for the Tensorflow Profiler which is organized around the neural network model training loop.
Is there a way to use Tensorflow Profiler for arbitrary Tensorflow computations, or is the go-to move in this case to use NVidia tools like nvprof?
I figured out that the nvprof and nvvp and nsight tools I was looking for are available as a Conda install of cudatoolkit-dev. Uses are described in this gist.
I have been looking to learn TensorFlow and I have noticed that different functions are used for the same goal. To square a variable for instance, I have seen tf.square(), tf.math.square() and tf.keras.backend.square(). This is the same for most math operations. Are all these the same or is there any difference?
Mathematically, they should produce the same result. However Tensorflow functions in tensorflow.math.somefunction are used for operating Tensorflow tensors.
For example, when you write a custom loss or metric, the inputs and outputs should be Tensorflow tensors. So that Tensorflow knows how to take gradients of the functions. You can also use tf.keras.backend.* functions for custom loss etc.
Try to use tensorflow.math.somefunctions whenever you can, native operations are preferred. Because they are officially documented and guarateed to have backward compatibility between TF versions like TF 1.x and TF 2.x.
I am trying to implement a custom convolution operation in tensorflow with c++ and cuda, and I found that the back-propagation for the Conv2D in tensorflow are implemented via two separate operations. Indeed, I found there are two operation implementations, namely conv_grad_filter_ops.cc and conv_grad_input_ops.cc in the tensorflow source code, which means the gradients for filter and input are calculated respectively. May I ask what is the idea behind this implementation? Why they were not simply merged together as one single operation?
Alright, I did a test and found that there's about 30% speed boost if the back propagation for different inputs are split into different TF ops compared with wrapped into one single TF op. This is against intuition, perhaps there's something related with TF's architecture. Note: my test was based on CUDA im2col/col2im with CuBLAS instead of CuDNN.
I'm trying to find where the implementation of an actual Conv2D operation is so I can assess memory access patterns. Tracing things around, it looks like execution of a Conv2D operation enters Eigen with a contract() function call. The problem is, I can't seem to find the definition or declaration of the function in either TensorFlow or Eigen source.
What functions are largely responsible for the execution of a Conv2D operation in TensorFlow? I'd like to see how it is paralyzed, what the general memory access pattern is, and how the raw computations are done.
This query is for CPU specifically, as I've already looked into GPU execution to a degree.
After some searching, I found actual implementation of CPU Conv2D is in deep_conv2d.cc.
I think Conv2dCPU is implemented in this file using Eigen conv ops Line 61 onwards
contract() returns an abstract expression whose evaluation is implemented in TensorContraction.h. It is essentially a wrapper on top of Eigen's matrix-matrix or matrix-vector products.
I am implementing element-wise operations in TensorFlow. Many TensorFlow operations, e.g. add, support broadcasting (from numpy). Broadcasting is possible if the following rule is respected:
When operating on two tensors, their shapes should be compared element-wise. The procedure starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when they are equal, or one of them is 1. If these conditions are not met, an exception is thrown, indicating that the tensors have incompatible shapes. The size of the resulting tensor is the maximum size along each dimension of the input arrays.
Does TensorFlow C++ API provide any method for comparing the compatibility of two tensors? Or, which is the fastest way to do that?
All element-wise binary operations' kernel implementations in TensorFlow derive from BinaryOpShared class, that does the compatibility checking via the helper class BinaryOpState. Perhaps, you can simply derive your kernel class from BinaryOpShared and get the compatibility checking for free.