Is there any good reason to transpose a tensor from NHWC to NCHW? - tensorflow

I often see the transpose implementation in tensorflow code. I wonder why one would want to transpose the NHWC tensor to NCHW. Please give me the good example and the reason behind it.

Rather than citing the documentation. You should read into how CUDA works and think about how to implement most operations.
The reason for NCHW generally being faster than NHWC is how the CUDA kernels are written. In CUDA you need to specify what each thread is doing like
const int threads = 32;
dim3 block(threads, threads);
dim3 grid(up2(W / 2, threads), up2(H, threads), B);
kernel<Dtype> <<< grid, block>>> (args ...)
Here you get 3 indices threadId.z, threadId.y, threadId.x. And these threads are organized in warps (hardware design).
And you want to have coalesced memory transaction, which means the threads are ordered in such a way, that the GPU can nicely operate in a fast way.
To sum it up:
You want to have "threadId.x" being the most inner-loop and you should organize the data layout such that it reading them in coalesced way. The ideal data structure should accessible by
b * C * H * W + c * H * W + h * W + w
where lower letters denote the index and capitalized letters denotes the shape (e.g., 0 <= w < W).
In convolution operations (a part of the most used layer) what you are essentially doing is cropping a region in each channel computing a dot production with a region in another channel (from another tensor). So the indices which need to run crazy fast are the height-idx and width-idx. In the end, you are adding along the channel axis (like the convolution formulae suggest). This also explains, why it makes no difference to consider NWHC, NCWH.
This has an impact on how you order the data. And it is the reason you want to have the memory layout I described above.
The worst layout would be:
H, C, B, in threadId.z, threadId.y, threadId.x
The best layout would be:
B, C, H in threadId.z, threadId.y, threadId.x
The same is (mostly) true for GEMM as well (here one matrix should be transpose). There is no source for CuDNN available. But you might be interested in looking into cutlass.

From the performance guide of Tensorflow:
NHWC is the TensorFlow default and NCHW is the optimal format to use
when training on NVIDIA GPUs using cuDNN. [...] The brief history of these two formats is that TensorFlow started by using NHWC because it was a little faster on CPUs. In the long term, we are working on tools to auto rewrite graphs to make switching between the formats transparent and take advantages of micro optimizations where a GPU Op may be faster using NHWC than the normally most efficient NCHW.
Essentially, cuDNN is optimized for NCHW, while CPU-only tensorflow is optimized for NHWC. Switching from one to the other is just a matter of performance maximization and/or unavailability of certain operations in a specific data format.

Related

Explanation of parallel arguments of tf.while_loop in TensorFlow

I want to implement an algorithm which allows a parallel implementation in TensorFlow. My question is what the arguments parallel_iterations, swap_memory and maximum_iterations actually do and which are their appropriate values according the situation. Specifically, in the documentation on TensorFlow's site https://www.tensorflow.org/api_docs/python/tf/while_loop says that parallel_iterations are the number of iterations allowed to run in parallel. Is this number the number of threads? When someone should allow CPU-GPU swap memory and for what reason? What are the advantages and disadvantages from this choice? What is the purpose of maximum_iterations? Can it be combined with parallel_iterations?
swap_memory is used when you want to have extra memory on the GPU device. Usually when you are training a model some activations are saved in the GPU mem. for later use. With swap_memory, you can store those activations in the CPU memory and use the GPU mem. to fit e.g. larger batch sizes. And this is an advantage. You would choose this if you need big batch_sizes or have long sequences and want to avoid OOM exceptions. Disadvantage is computation time since you need to transfer the data from CPU mem. to GPU mem.
The maximum iterations is smth. like this:
while num_iter < 100 and <some condition>:
do something
num_iter += 1
So it is useful when you check a condition, but also want to have an upper bound (one example is to check if your model converges. If it doesn't you still want to end after k iterations.)
As for parallel_iterations I am not sure, but it sounds like multiple threads, yes. You can try and see the effect in a sample script.

More efficient 1d convolution in TensorFlow when filter size is 1

I want a "convolutional" layer where the filter size is 1. I can achieve this using (option 1)
tflearn.layers.conv.conv_1d( input, n_output_channels, 1 )
or roll my own using (option 2)
tf.matmult( input, tf.tile( weights, [batch_size, 1, 1] ) )
where input has dimensions [batch,sequence,n_input_channels] and weights is [1,n_input_channels,n_output_channels].
The performance of these two options seems roughly equivalent, but I would guess both have inefficiencies: option 1 presumably has overhead from expecting a "real" convolutional, and the tile operation seems like it should be unnecessary in option 2. Is there a smarter way I could be doing this?
If you plan on using the GPU, the best is probably to stick to native cuDNN operations, in that case convolutions.
NVIDIA does not provide much details on their implementation, but I would be surprised that they do not have dedicated implementations for common, small sizes used in NN, including kernels with 1x1 spatial range. That most probably applies to other NN-specialized library as well such as Intel MKL-DNN on CPU.
Only when using generic, non-NN convolution libraries, or badly optimized ones, should your question apply. I don't think that is the case of tensorflow or any other major DL libaries and their dependencies out there. (Could be interesting to check in Eigen.)

Is there a GPU accelerated numpy.max(X, axis=0) implementation in Theano?

Do we have a GPU accelerated of version of numpy.max(X, axis=None) in Theano.
I looked into the documentation and found theano.tensor.max(X, axis=None), but it is 4-5 times slower than the numpy implementation.
I can assure you, it is not slow because of some bad choice of matrix size. Same matrix under theano.tensor.exp is 40 times faster than its numpy counterpart.
Any suggestions?
The previous answer is partial. The suggestion should not work, as the work around is the one used in the final compiled code. There is optimization that will do this transformation automatically.
The title of the question isn't the same as the content. They differ by the axis argument. I'll answer both questions.
If the axis is 0 or None we support this on the GPU for that operation for matrix. If the axis is None, we have a basic implementation that isn't well optimized as it is harder to parallelize. If the axis is 0, we have a basic implementation, but it is faster as it is easier to parallelize.
Also, how did you do your timing? If you just make one function with only that operation and test it via the device=gpu flags to do your comparison, this will include the transfer time between CPU and GPU. This is a memory bound operation, so if you include the transfer in your timming, personnaly I don't expect any speed op for that case. To see only the GPU operation, use Theano profiler: run with the Theano flag profile=True.
The max and exp operations are fundamentally different; exp (and other operations like addition, sin, etc.) is an elementwise operation that is embarrassingly parallelizable, while max requires a parallel-processing scan algorithm that basically builds up a tree of pairwise comparisons over an array. It's not impossible to speed up max, but it's not as easy as exp.
Anyway, the theano implementation of max basically consists of the following lines (in theano/tensor/basic.py):
try:
out = max_and_argmax(x, axis)[0]
except Exception:
out = CAReduce(scal.maximum, axis)(x)
where max_and_argmax is a bunch of custom code that, to my eye, implements a max+argmax operation using numpy, and CAReduce is a generic GPU-accelerated scan operation used as a fallback (which, according to the comments, doesn't support grad etc.). You could try using the fallback directly and see whether that is faster, maybe something like this:
from theano.tensor.elemwise import CAReduce
from theano.scalar import maximum
def mymax(X, axis=None):
CAReduce(maximum, axis)(X)

numpy correlation coefficient: np.dot(A, A.T) on large arrays causing seg fault

NOTE:
Speed is not as important as getting a final result.
However, some speed up over worst case is required as well.
I have a large array A:
A.shape=(20000,265) # or possibly larger like 50,000 x 265
I need to compute the correlation coefficients.
np.corrcoeff # internally casts the results as doubles
I just borrowed their code and wrote my own cov/corr not casting into doubles, since I really only need 32 bit floats.And I ditch the conj() since my data are always real.
cov = A.dot(A.T)/n #where A is an array of 32 bit floats
diag = np.diag(cov)
corr = cov / np.sqrt(np.mutliply.outer(d,d))
I still run out of memory and I'm using a large memory machine, 264GB
I've been told, that the fast C libraries, are probably using a routine which breaks the
dot product up into pieces, and to optimize this, the number of elements is padded to a power of 2.
I don't really need to compute the symmetric half of the correlation coefficient matrix.
However, I don't see a way to do this in reasonable amount of time doing it "manually", with python loops.
Does anybody know of a way to ask numpy for a decent dot product routine, that balances memory usage with speed...?
Cheers
UPDATE:
Funny how writing these questions tends to help me find the language for a better google query.
Found this:
http://wiki.scipy.org/PerformanceTips
Not sure that I follow it....so, please comment or provide answers about this solution, your own ideas, or just general commentary on this type of problem.
TIA
EDIT: I apologize because my array is really much bigger than I thought.
array size is actually 151,000 x 265
I''m running out of memory on a machine with 264 GB with at least 230 GB free.
I'm surprised that the numpy call to blas dgemm and being careful with C order arrays
didn't do squat.
Python compiled with intel's mkl will run this with 12GB of memory in about 30 seconds:
>>> A = np.random.rand(50000,265).astype(np.float32)
>>> A.dot(A.T)
array([[ 86.54410553, 64.25226593, 67.24698639, ..., 68.5118103 ,
64.57299805, 66.69223785],
...,
[ 66.69223785, 62.01016235, 67.35866547, ..., 66.66306305,
65.75863647, 86.3017807 ]], dtype=float32)
If you do not have access to in intel's MKL download python anaconda and install the accelerate package which has a trial version for 30 days or free for academics that contains a mkl compile. Various other C++ BLAS libraries should work also- even if it copies the array from C to F it should not take more then ~30GB of memory.
The only thing that I can think of that your installation is trying to do is try to hold the entire 50,000 x 50,000 x 265 array in memory which is quite frankly terrible. For reference a float32 50,000 x 50,000 array is only 10GB, while the aforementioned array is 2.6TB...
If its a gemm issue you can try a chunk gemm formula:
def chunk_gemm(A, B, csize):
out = np.empty((A.shape[0],B.shape[1]), dtype=A.dtype)
for i in xrange(0, A.shape[0], csize):
iend = i+csize
for j in xrange(0, B.shape[1], csize):
jend = j+csize
out[i:iend, j:jend] = np.dot(A[i:iend], B[:,j:jend])
return out
This will be slower, but will hopefully get over your memory issues.
You can try and see if np.einsum works better than dot for your case:
cov = np.einsum('ij,kj->ik', A, A) / n
The internal workings of dot are a little obscure, as it tries to use BLAS optimized routines, which sometimes require copies of arrays to be in Fortran order, not sure if that's the case here. einsum will buffer its inputs, and use vectorized SIMD operations where possible, but outside that it is basically going to run the naive three nested loops to compute the matrix product.
UPDATE: Turns out the dot product completed with out error, but upon careful inspection
the output array consists of zeros at 95,000 to the end, of the 151,000 cols.
That is, out[:,94999] = non-zero but out[:,95000] = 0 for all rows...
This is super annoying...
Another Blas description
The exchange, mentions something that I thought about too...Since blas is fortran, shouldn't
the order of the input be F order...? Where as the scipy doc page below, says C order.
Trying F order caused a segmentation fault. So I'm back to square one.
ORIGINAL POST
I finally tracked down my problem, which was in the details as usual.
I'm using an array of np.float32 which were stored as F order. I can't control the F order to my knowledge, since the data is loaded from images using an imaging library.
import scipy
roi = np.ascontiguousarray( roi )# see roi.flags below
out = scipy.linalg.blas.sgemm(alpha=1.0, a=roi, b=roi, trans_b=True)
This level 3 blas routine does the trick. My problem was two fold:
roi.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
And... i was using blas dgemm NOT sgemm. The 'd' is for 'double' and 's' for 'single'.
See this pdf: BLAS summary pdf
I looked at it once and was overwhelmed...I went back and read the wikipedia article on blas routines to understand level 3 vs other levels: wikipedia article on blas
Now it works on A = 150,000 x 265, performing:
A \dot A.T
Thanks everyone for your thoughts...knowing that it could be done was most important.

What are the downsides of convolution by FFT compared to realspace convolution?

So I am aware that a convolution by FFT has a lower computational complexity than a convolution in real space. But what are the downsides of an FFT convolution?
Does the kernel size always have to match the image size, or are there functions that take care of this, for example in pythons numpy and scipy packages? And what about anti-aliasing effects?
FFT convolutions are based on the convolution theorem, which states that given two functions f and g, if Fd() and Fi() denote the direct and inverse Fourier transform, and * and . convolution and multiplication, then:
f*g = Fi(Fd(d).Fd(g))
To apply this to a signal f and a kernel g, there are some things you need to take care of:
f and g have to be of the same size for the multiplication step to be possible, so you need to zero-pad the kernel (or input, if the kernel is longer than it).
When doing a DFT, which is what FFT does, the resulting frequency domain representation of the function is periodic. This means that, by default, your kernel wraps around the edge when doing the convolution. If you want this, then all is great. But if not, you have to add an extra zero-padding the size of the kernel to avoid it.
Most (all?) FFT packages only work well (performance-wise) with sizes that do not have any large prime factors. Rounding the signal and kernel size up to the next power of two is a common practice that may result in a (very) significant speed-up.
If your signal and kernel sizes are f_l and g_l, doing a straightforward convolution in time domain requires g_l * (f_l - g_l + 1) multiplications and (g_l - 1) * (f_l - g_l + 1) additions.
For the FFT approach, you have to do 3 FFTs of size at least f_l + g_l, as well as f_l + g_l multiplications.
For large sizes of both f and g, the FFT is clearly superior with its n*log(n) complexity. For small kernels, the direct approach may be faster.
scipy.signal has both convolve and fftconvolve methods for you to play around. And fftconvolve handles all the padding described above transparently for you.
While fast convolution has better "big O" complexity than direct form convolution; there are a few drawbacks or caveats. I did some thinking about this topic for an article I wrote a while back.
Better "big O" complexity is not always better. Direct form convolution can be faster than using FFTs for filters smaller than a certain size. The exact size depends on the platform and implementations used. The crossover point is usually in the 10-40 coefficient range.
Latency. Fast convolution is inherently a blockwise algorithm. Queueing up hundreds or thousands of samples at a time before transforming them may be unacceptable for some real-time applications.
Implementation complexity. Direct form is simpler in terms of the memory, code space and in the theoretical background of the writer/maintainer.
On a fixed point DSP platform (not a general purpose CPU): the limited word size considerations of fixed-point FFT make large fixed point FFTs nearly useless. At the other end of the size spectrum, these chips have specialized MAC intstructions that are well designed for performing direct form FIR computation, increasing the range over which te O(N^2) direct form is faster than O(NlogN). These factors tend to create a limited "sweet spot" where fixed point FFTs are useful for Fast Convolution.