What does it mean to say convolution implementation is based on GEMM (matrix multiply) or it is based on 1x1 kernels? - tensorflow

I have been trying to understand (but miserably failing) how convolutions on images (with height, width, channels) are implemented in software.
I've heard people say their convolution implementation is done using GEMM, or done using "Direct convolution" or done using 1x1 kernels.
I find it very confusing and can't wrap my head around so many different ways it's described everywhere - I thought I understood a typical convolution like pytorch conv2d as a mathematical operation on an image, but what do they mean when someone says they do conv2d using one of the following ways?
1x1 kernels or 1x1 convolution (what does kernel even mean here)
GEMM
"direct convolution"
For doing Convolution using GEMM, what I understand based on this paper is that each of the input-image and filters are converted to 2d matrices using im2col and im2row ops and then these two are simply matrix-multiplied.
The 3d input image (height, width, input-channels) is converted to a 2d matrix, the 4-d kernel (output-channels, input-channels, kernel-height, kernel-width) is converted to a 2d matrix. Or does "GEMM-based implementation of convolution" mean something else? If that's what it means then how is it different than doing "convolution using 1x1 kernels"?

1x1 kernels or 1x1 convolution (what does kernel even mean here)
You can have 3x3 convolution, so you have a square containing 9 elements sliding over the image (with some specified stride, dilation etc.). In this case you have 1x1 convolution so the kernel is a single element (with stride=1 as well and no dilation).
So instead of sliding window with summation you simply project linearly each pixel with this single valued kernel.
It is a cheap operation and is used as part of depthwise separable convolutions used in many modern architectures to increase/decrease number of channels.
GEMM
In the article you provided there is as the top:
[...] function called GEMM. It’s part of the BLAS (Basic Linear Algebra
Subprograms)
So BLAS is a specification which describes a set of low-level algebraic operations and how they should be performed on computer.
Now, you have a lot of implementations of BLAS tailored to specific architectures or having some traits usable in some context. For example there is cuBLAS which is written and optimized for GPU (and used heavily by deep learning "higher level" libraries like PyTorch) or Intel's MKL for Intel CPUs (you can read more about BLAS anywhere on the web)
Usually those are written with a low-level (Fortran, C, Assembly, C++) languages for maximum performance.
GEMM is GEneralized Matrix multiplication routine which is used to implement fully connected layers and convolutions and is provided by various BLAS implementations.
It has nothing to do with the deep learning convolution per-se, it is a fast matrix multiplication routine (considering things like cache hit)
Direct convolutions
It is an approach which is O(n^2) complexity so you simply multiply items with each other. There is more efficient approach using Fast Fourier Transformation which is O(n*log(n)). Some info presented in this answer and questions about this part would be better suited for math related stackexchanges.

Related

Assign Keras/TF/PyTorch layer to hardware type

Suppose we have the following architecture:
Multiple CNN layers
RNN layer
(Time-distributed) Dense classification layer
We want to train this architecture now. Our fancy GPU is very fast at solving the CNN layers. Although using a lower clockrate, it can perform many convolutions in parallel, thus the speed. Our fancy CPU however is faster for the (very long) resulting time series, because the time steps cannot be parallelized, and the processing profits from the higher CPU clockrate. So the (supposedly) smart idea for execution would look like this:
Multiple CNN layers (run on GPU)
RNN layer (run on CPU)
(Time-distributed) Dense classification layer (run on GPU/CPU)
This lead me to two important questions:
Is it possible, with any of the frameworks mentioned in the title, to distribute certain layers to certain hardware, and how?
If it is possible, would the overhead for the additional memory operations, e.g. tranferring between GPU-/CPU-RAM, render the whole idea useless?
Basically, in Pytorch you can control the device on which variables/parameters reside. AFAIK, it is your responsibility to make sure that for each operation all the arguments reside on the same device: i.e., you cannot conv(x, y) where x is on GPU and y is on CPU.
This is done via pytorch's .to() method that moves a module/variable .to('cpu') or .to('cuda:0')
As Shai mentioned you can control this yourself in pytorch so in theory you can have parts of your model on different devices. You have then to move data between devices in your forward pass.
I think the overhead as you mentioned would make the performance worst. The cuda RNN implementation benefits greatly from running on a gpu anyways :)

GPU convolution performance in TensorFlow

How are the 2D / 3D convolution operations in TensorFlow implemented, as sparse matrix - dense vector multiplications? How is performance optimised for applying many filters to the same image? How is the performance optimised for applying the same filters for many images? Is there a paper that describes the performance for different ways of implementing the convolutions?
The reason I'm asking is that I have implemented and optimised many 2D and 3D (also 4D) convolvers with CUDA, and I'm thinking if I should compare their performance to TensorFlow or not.

How to visualize (and understand) transposed convolutions?

I have seen two ways of visualizing transposed convolutions from credible sources, and as far as I can see they conflict.
My question boils down to, for each application of the kernel, do we go from many (e.g. 3x3) elements with input padding to one, or do we go from one element to many (e.g. 3x3)?
Related question: Which version does tf.nn.conv2d_transpose implement?
The sources of my confusion are:
A guide to convolution arithmetic for deep learning has probably the most famous visualization out there, but it isn't peer reviewed (Arxiv).
The second is from Deconvolution and Checkerboard Artifacts, which technically isn't peer reviewed either (Distil), but it is from a much more reputable source.
(The term deconvolution is used in the article, but it is stated that this is the same as transposed conv.)
Due to the nature of this question it is hard to look for results online, e.g. this SO posts takes the first position, but I am not sure to what extent I can trust it.
I want to stress a little more what Littleone also mentioned in his last paragraph:
A transposed convolution will reverse the spatial transformation of a regular convolution with the same parameters.
If you perform a regular convolution followed by a transposed convolution and both have the same settings (kernel size, padding, stride), then the input and output will have the same shape. This makes it super easy to build encoder-decoder networks with them. I wrote an article about different types of convolutions in Deep Learning here, where this is also covered.
PS: Please don't call it a deconvolution
Strided convolutions, deconvolutions, transposed convolutions all mean the same thing. Both papers are correct and you don't need to be doubtful as both of them are cited a lot. But the distil image is from a different perspective as its trying to show the artifacts problem.
The first visualisation is transposed convolutions with stride 2 and padding 1. If it was stride 1, there wouldn't be any padding in between inputs. The padding on the borders depend on the dimension of the output.
By deconvolution, we generally go from a smaller dimension to a higher dimension. And input data is generally padded to achieve the desired output dimensions. I believe the confusion arises from the padding patterns. Take a look at this formula
output = [(input-1)stride]+kernel_size-2*padding_of_output
Its a rearrangement of the general convolution output formula. Output here refers to the output of the deconvolution operation. To best understand deconvolution, I suggest thinking in terms of the equation, i.e., flipping what a convolution does. Its asking how do I reverse what a convolution operation does?
Hope that helps.
Good explanation from Justin Johnson (part of the Stanford cs231n mooc):
https://youtu.be/ByjaPdWXKJ4?t=1221 (starts at 20:21)
He reviews strided conv and then he explains transposed convolutions.

Can tf.contrib.layers.batch_norm calculate 2-D convolutional batch normalization as in the paper?

As described in the original paper of batch normalization, batch normalization on 1-D feature (for example, from a fully connected layer) and that on 2-D feature (for example, from a convolutional layer) are different in a nontrivial way.
The tensorflow library provided an easy way to batch normalize with 1-D feature but I'm not sure if it is the same case for 2-D. The tool is tf.contrib.layers.batch_norm.
I don't fully understand this method but can we apply this method for 2-D batch normalization?
I saw some people use it on 2-D feature map (with multiple channels): example 1 (link 1, link 2).
Source page
You can check the usage of batch_normalization here or search for the usage of fused_bn after Tensorflow 1.0.

Fast 2D convolution implementation?

I've made a CUDA program for 2D convolution and now want to compare it to some non-CUDA implementation to measure the speedup.
I could compare to my own implementation in plain C using the classical multiple loop approach or matlab's conv2 but it doesn't feel like a legit/fair comparison, since they're not the fastest implementations out there.
Also I was thinking of trying OpenCV and I've been looking for a SIMD optimized version with no luck. Any advice, should I go with OpenCV?
NOTE: I've read other questions, including this one, but the answer is basically the same as my plain C code or a discussion of the various methods available.
The fastest general 2D convolution algorithm is going to perform the FFT on the source first, then correlate, then FFT back to get the result (which is what conv2 does in matlab) so your multiple loop approach probably isn't the best.
The GSL is going to give you a standard, and fast implementation of the FFT if you want to use that.
Also, if the kernel is separable you may be able to do the convolution as two 1D convolutions.
OpenCV is great if that works too, it should be widely accepted as a fast implementation.