Convolving a feature map with itself - tensorflow

I use TensorFlow 1.12
I would like to take a batch of feature maps [B,H,W,C], and would like to convolve each channel with itself.
This probably is possible with tf.map_fn, but I would like to keep these operations as vectorized as possible.
What is the best vectorized way of achieving it ?

Each channel is an image. Convolving an image with another, itself in this case, is most efficiently implemented in the Fourier domain using the convolution theorem. It states that a convolution of an image with another is the same as the inverse Fourier transform of the dot product of their Fourier transforms. Breaking that into steps:
Optionally, pad the images with zeros
Fourier transform both images.
Calculate the dot product of the Fourier transforms.
Inverse Fourier transform.
Both images being the same is a special case.

Related

What does it mean to say convolution implementation is based on GEMM (matrix multiply) or it is based on 1x1 kernels?

I have been trying to understand (but miserably failing) how convolutions on images (with height, width, channels) are implemented in software.
I've heard people say their convolution implementation is done using GEMM, or done using "Direct convolution" or done using 1x1 kernels.
I find it very confusing and can't wrap my head around so many different ways it's described everywhere - I thought I understood a typical convolution like pytorch conv2d as a mathematical operation on an image, but what do they mean when someone says they do conv2d using one of the following ways?
1x1 kernels or 1x1 convolution (what does kernel even mean here)
GEMM
"direct convolution"
For doing Convolution using GEMM, what I understand based on this paper is that each of the input-image and filters are converted to 2d matrices using im2col and im2row ops and then these two are simply matrix-multiplied.
The 3d input image (height, width, input-channels) is converted to a 2d matrix, the 4-d kernel (output-channels, input-channels, kernel-height, kernel-width) is converted to a 2d matrix. Or does "GEMM-based implementation of convolution" mean something else? If that's what it means then how is it different than doing "convolution using 1x1 kernels"?
1x1 kernels or 1x1 convolution (what does kernel even mean here)
You can have 3x3 convolution, so you have a square containing 9 elements sliding over the image (with some specified stride, dilation etc.). In this case you have 1x1 convolution so the kernel is a single element (with stride=1 as well and no dilation).
So instead of sliding window with summation you simply project linearly each pixel with this single valued kernel.
It is a cheap operation and is used as part of depthwise separable convolutions used in many modern architectures to increase/decrease number of channels.
GEMM
In the article you provided there is as the top:
[...] function called GEMM. It’s part of the BLAS (Basic Linear Algebra
Subprograms)
So BLAS is a specification which describes a set of low-level algebraic operations and how they should be performed on computer.
Now, you have a lot of implementations of BLAS tailored to specific architectures or having some traits usable in some context. For example there is cuBLAS which is written and optimized for GPU (and used heavily by deep learning "higher level" libraries like PyTorch) or Intel's MKL for Intel CPUs (you can read more about BLAS anywhere on the web)
Usually those are written with a low-level (Fortran, C, Assembly, C++) languages for maximum performance.
GEMM is GEneralized Matrix multiplication routine which is used to implement fully connected layers and convolutions and is provided by various BLAS implementations.
It has nothing to do with the deep learning convolution per-se, it is a fast matrix multiplication routine (considering things like cache hit)
Direct convolutions
It is an approach which is O(n^2) complexity so you simply multiply items with each other. There is more efficient approach using Fast Fourier Transformation which is O(n*log(n)). Some info presented in this answer and questions about this part would be better suited for math related stackexchanges.

Tensorflow Object Detection: Resizing Images and Padding

I am trying to create a tensorflow object detection with Single Shot Multibox Detector (SSD) with MobileNet. My dataset consists of images larger than 300x300 pixels (e.g. 1280x1080). I know that tensorflow object detection reduces the images to 300x300 in the training process, what I am interested in is the following:
Does it have a positive or negative influence on the later object detection if I reduce the pictures to 300x300 pixels before the training with padding? Without padding I don't think it has any negative effects, but with padding I'm not sure if it has any effects that I'm overlooking.
Thanks a lot in advance!
I dont know SSD, but CNNs generally use convolutional layers as feature extractors, stacked upon another with different kernel sizes representing different feature sizes, i.e. using spatial correlation to their advantage. If you use padding, the padding will thus be incorporated into the extracted features, possible corrupting your results.

Is Capsule Network really rotationally invariant in practice?

Capsule network is said to perform well under rotation..??*
I trained a Capsule Network with (train-dataset) to get train-accuracy ~100%..
i tested the network with the (test-dataset-original) to get test-accuracy ~99%
i rotated the (test-dataset-original) by 0.5 (test-dataset-rotate0p5) and
1 degrees to get (test-dataset-rotate1) and got the test-accuracy of just ~10%
i used the network from this repo as a seed https://github.com/naturomics/CapsNet-Tensorflow
10% acc is not acceptable at all on rotated test data. perhaps something doesn't implement correctly.
we implemented capsnet on some non-english digit datasets (similar to mnist) and the result was unbelievable great.
the implemented model was invariant not only in rotation but also on other transform such as pan, zoom, perspective and etc
The first layer of a capsule network is normal convolution. The filters here are not rotation invariant, only the output feature maps are applied a pose matrix by the primary capsule layer.
I think this is why you also need to show the capsnet rotated images. But much fewer than for normal convnets.
Capsule networks encapsule vectors or 4x4 matrices in a neural network. However, matrices can be used for many things, rotations being just one of them. There's no way the network can know that you want to use the encapsuled representation for rotations, except if you specifically show it, so it can learn to use this for rotations..
Capsule Networks came into existence to solve the problem of viewpoint variance problem in convolutional neural networks (CNNs). CapsNet is said to be viewpoint invariant that includes rotational and translational invariance.
CNNs have translational invariance by using max-pooling but that results in information loss in the receptive field. And as the network goes deeper, the receptive field also increases gradually and hence max-pooling in deeper layers cause more information loss. This results in loss of the spatial information and only local/temporal information is learned by the network. CNNs fail to learn the bigger picture of the input.
The weights Wij (between primary and secondary capsule layer) are backpropagated to learn the affine transformation on the entity represented by the ith capsule in primary layer and make a predicted vector uj|i. So basically this Wij is responsible for learning rotational transformations for a given entity.

Can tf.contrib.layers.batch_norm calculate 2-D convolutional batch normalization as in the paper?

As described in the original paper of batch normalization, batch normalization on 1-D feature (for example, from a fully connected layer) and that on 2-D feature (for example, from a convolutional layer) are different in a nontrivial way.
The tensorflow library provided an easy way to batch normalize with 1-D feature but I'm not sure if it is the same case for 2-D. The tool is tf.contrib.layers.batch_norm.
I don't fully understand this method but can we apply this method for 2-D batch normalization?
I saw some people use it on 2-D feature map (with multiple channels): example 1 (link 1, link 2).
Source page
You can check the usage of batch_normalization here or search for the usage of fused_bn after Tensorflow 1.0.

2D image decomposition

I have a matrix and I want to decompose it into different matrices with low to high frequency limit. As I have noticed, it can be done using wavelet transform. I found something like the figure below for 1D signal and I want to do similar procedure for my 2D matrix using MATLAB. I want to decompose it to different matrices with low to high frequency components in different levels.
I used the matrix tool box, however, when I have problems with extracting the data.
How can I do this using MATLAB?
You are looking for the wavedec2 function.
There's a basic example w/ the function documentation here