I know tensorflow has provided some approaches to deal with sparse tensor. For a example, the tf.sparse_tensor_dense_matmul is faster than the tf.matmul when there is a sparse matrix.
In a deep convolution network, I get sparse convolution kernels after a training process. I want to know how to save the convolution kernels so that the tensorflow knows the kernels are sparse?
I have read some papers. The papers propose the sparse convolution will bring more efficient calculation than traditional convolution. But the tf.nn.conv2d doesn't indicate that it will calculate with a sparse convolution kernel faster than a dense convolution kernel. How do I obtain the advantages from sparse kernels?
Yes, tf.nn.conv2d does not work with sparse kernel. If you think that sparse convolution will bring you the benefits of speed and feel comfortable writing efficient cpu/gpu code, you can write your own op, the way it is described in their docs
Related
I have been trying to understand (but miserably failing) how convolutions on images (with height, width, channels) are implemented in software.
I've heard people say their convolution implementation is done using GEMM, or done using "Direct convolution" or done using 1x1 kernels.
I find it very confusing and can't wrap my head around so many different ways it's described everywhere - I thought I understood a typical convolution like pytorch conv2d as a mathematical operation on an image, but what do they mean when someone says they do conv2d using one of the following ways?
1x1 kernels or 1x1 convolution (what does kernel even mean here)
GEMM
"direct convolution"
For doing Convolution using GEMM, what I understand based on this paper is that each of the input-image and filters are converted to 2d matrices using im2col and im2row ops and then these two are simply matrix-multiplied.
The 3d input image (height, width, input-channels) is converted to a 2d matrix, the 4-d kernel (output-channels, input-channels, kernel-height, kernel-width) is converted to a 2d matrix. Or does "GEMM-based implementation of convolution" mean something else? If that's what it means then how is it different than doing "convolution using 1x1 kernels"?
1x1 kernels or 1x1 convolution (what does kernel even mean here)
You can have 3x3 convolution, so you have a square containing 9 elements sliding over the image (with some specified stride, dilation etc.). In this case you have 1x1 convolution so the kernel is a single element (with stride=1 as well and no dilation).
So instead of sliding window with summation you simply project linearly each pixel with this single valued kernel.
It is a cheap operation and is used as part of depthwise separable convolutions used in many modern architectures to increase/decrease number of channels.
GEMM
In the article you provided there is as the top:
[...] function called GEMM. It’s part of the BLAS (Basic Linear Algebra
Subprograms)
So BLAS is a specification which describes a set of low-level algebraic operations and how they should be performed on computer.
Now, you have a lot of implementations of BLAS tailored to specific architectures or having some traits usable in some context. For example there is cuBLAS which is written and optimized for GPU (and used heavily by deep learning "higher level" libraries like PyTorch) or Intel's MKL for Intel CPUs (you can read more about BLAS anywhere on the web)
Usually those are written with a low-level (Fortran, C, Assembly, C++) languages for maximum performance.
GEMM is GEneralized Matrix multiplication routine which is used to implement fully connected layers and convolutions and is provided by various BLAS implementations.
It has nothing to do with the deep learning convolution per-se, it is a fast matrix multiplication routine (considering things like cache hit)
Direct convolutions
It is an approach which is O(n^2) complexity so you simply multiply items with each other. There is more efficient approach using Fast Fourier Transformation which is O(n*log(n)). Some info presented in this answer and questions about this part would be better suited for math related stackexchanges.
A Neural Algorithm of Artistic Style uses the Gramian Matrix of the intermediate feature vectors of the VGG16 classification network trained on ImageNet. Back then, that was probably a good choice because VGG16 was one of the best-performing classification. Nowadays, there are much more efficient classification networks that surpass VGG in classification performance while requiring fewer parameters and FLOPS, for example EfficientNet and MobileNetv2.
But when I tried this out in practice, the Gramian Matrix for VGG16 features appears representative of the image style in that its L2 distance for stylistically similar images is smaller than the L2 distance to stylistically unrelated images. For the Gramian Matrix calculated from EfficientNet and MobileNetv2 features, that does not appear to be the case. The L2 distance between very similar images and between very dissimilar images only varies by about 5%.
From the network structure, VGG, EfficientNet, and MobileNet all have convolutions with batch normalization and ReLU in between, so the building blocks are the same. Then which design decision is unique to VGG so that its Gramian Matrix captures the style, while EfficientNet's and MobileNet's do not?
By now, I figured it out: The Gramian Matrix needs partially correlated features to work correctly. Newer networks are trained with a Dropout regularizer, which will reduce the inter-feature correlation.
I would like to perform a convolution with a complex input and complex kernel.
I do not need back propagation, so I don't care whether gradients can be calculated or not.
Is this possible in Tensorflow?
How are the 2D / 3D convolution operations in TensorFlow implemented, as sparse matrix - dense vector multiplications? How is performance optimised for applying many filters to the same image? How is the performance optimised for applying the same filters for many images? Is there a paper that describes the performance for different ways of implementing the convolutions?
The reason I'm asking is that I have implemented and optimised many 2D and 3D (also 4D) convolvers with CUDA, and I'm thinking if I should compare their performance to TensorFlow or not.
As described in the original paper of batch normalization, batch normalization on 1-D feature (for example, from a fully connected layer) and that on 2-D feature (for example, from a convolutional layer) are different in a nontrivial way.
The tensorflow library provided an easy way to batch normalize with 1-D feature but I'm not sure if it is the same case for 2-D. The tool is tf.contrib.layers.batch_norm.
I don't fully understand this method but can we apply this method for 2-D batch normalization?
I saw some people use it on 2-D feature map (with multiple channels): example 1 (link 1, link 2).
Source page
You can check the usage of batch_normalization here or search for the usage of fused_bn after Tensorflow 1.0.