Understanding 2D convolution output size - tensorflow

I am a beginner in Convolutional DL. I saw the following architecture in paper Simultaneous Feature Learning and Hash Coding with Deep Neural Networks: For images of size 256*256,
I do not understand the output size of the first 2D convolution: 96*54*54. 96 seems fine as the number of filters is 96. But, if we apply the following formula for the output size: size = [(W−K+2P)/S]+1 = [(256 - 11 + 2*0)/4] + 1 = 62.25 ~ 62. I have assumed the padding, P to be 0 as it is not mentioned in the paper anywhere. Keras Conv2D API produces the same 96*62*62 size output. Then, why paper points to 96*54*54? What am I missing?

Well, it reminded me AlexNet paper where there was a similar mistake. Your calculation is correct. I think they mistakenly write 256x256 instead of 224x224, in which case the calculation for the input layer is,
(224-11+2*0)/4 + 1 = 54.25 ~ 54
It's highly possible that authors mistakenly wrote 256x256 instead of the real architecture input size being 224x224 (that was the case in AlexNet also), or the other less possible option is they wrote 256x256 which was the real architecture input size, but do the calculations for 224x224. The latter is ignorable as I think it is a very silly mistake and I don't think that's even an option.
Thus, I believe the true input size was 224x224 instead of 256x256.

Related

Implement CVAE for a single image

I have a multi-dimensional, hyper-spectral image (channels, width, height = 15, 2500, 2500). I want to compress its 15 channel dimensions into 5 channels.So, the output would be (channels, width, height = 5, 2500, 2500). One simple way to do is to apply PCA. However, performance is not so good. Thus, I want to use Variational AutoEncoder(VAE).
When I saw the available solution in Tensorflow or keras library, it shows an example of clustering the whole images using Convolutional Variational AutoEncoder(CVAE).
https://www.tensorflow.org/tutorials/generative/cvae
https://keras.io/examples/generative/vae/
However, I have a single image. What is the best practice to implement CVAE? Is it by generating sample images by moving window approach?
One way of doing it would be to have a CVAE that takes as input (and output) values of all the spectral features for each of the spatial coordinates (the stacks circled in red in the picture). So, in the case of your image, you would have 2500*2500 = 6250000 input data samples, which are all vectors of length 15. And then the dimension of the middle layer would be a vector of length 5. And, instead of 2D convolutions that are normally used along the spatial domain of images, in this case it would make sense to use 1D convolution over the spectral domain (since the values of neighbouring wavelengths are also correlated). But I think using only fully-connected layers would also make sense.
As a disclaimer, I haven’t seen CVAEs used in this way before, but like this, you would also get many data samples, which is needed in order for the learning generalise well.
Another option would be indeed what you suggested -- to just generate the samples (patches) using a moving window (maybe with a stride that is the half size of the patch). Even though you wouldn't necessarily get enough data samples for the CVAE to generalise really well on all HSI images, I guess it doesn't matter (if it overfits), since you want to use it on that same image.

Smaller output stride and bigger atrous rates produces larger heatmaps

I am using DeepLabv3+ and I am running some tests. For my first run I used an output_stride=16 and atrous_rates=[6, 12, 18] and in the 2nd run I used output_stride=8 and atrous_rates=[12,24, 36]. Then I used tensorboard to see the results and I could notice that the heatmaps look larger and one "unit" is 4x bigger than the run with output_stride=16.
output_stride=16
output_stride=8
I would like to know what is the reason behing this behaviour and the consequences on my mIOU metric.
regards
According to the paper Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation (3.1 DeeplabV3+ as an encoder), output_stride simply means the ratio between image input size and feature map output size (before global pooling). So change output_stride will change the output result.
just copy form link.

Input feature to Feature maps

Can anybody please explain this basic thing to me that how does a 192x28x28 input image gets reduced to a 16x28x28 feature maps using a 1x1 conv mapping. My question is about the understanding of what exactly happens when 192 goes to 16 ??
i know about ((I-2P-F)/S)+1, but what happens in the process of reducing depth.
The 1x1 Convolution compresses the whole 192*28*28 input image (which could be read as 192 feature maps of 28px * 28px pixels images) into a single 1*28*28 image. So far it reduces depth in the "feature map axis" to 1 while preserving the height and width of the original image.
But then... why do you get the 16? In a convolutional layer you can have different kernels. Basically each kernel is an indepentent filter with the same size. In your case it looks like your 1x1 Conv layer has 16 kernels by default, hence you get 16 28*28 images (one per kernel).

Avoiding exhausting GPU resources in convNN Tensorflow

I'm trying to run a hyperparameter optimization script, for a convNN using Tensorflow.
As you may know, TF handling of the GPU-Memory isn't that fancy(don't think it will ever be, thanks to the TPU). So my question is how do I know to choose the filter dimensions and the batchsize, so that the GPU-memory don't get exhausted.
Here's the equation that I'm thinking of:
image_shape =128x128x3(3 color channel)
batchSitze = 20 ( is the smallest possible batchsize, since I got 20 klasses)
filter_shape= fw_fh_fd[filter_width=4, filter_height=4, filter_depth=32]
As far as understood, using tf.conv2d function will need the following amount of memory:
image_width * image_height *numerofchannel*batchSize*filter_height*filter_width*filter_depth*32bit
since we're tf.float32 type for each pixel.
in the given example, the needed memory, will be :
128x128x3x20x4x4x32x32 =16106127360 (bits), which is all most 16GB of memory.
I'm not the formula is correct, so I hope to get a validation or the a correction of what I'm missing.
Actually, this will take only about 44MB of memory, mostly taken by the output.
Your input is 20x128x128x3
The convolution kernel is 4x4x3x32
The output is 20x128x128x32
When you sum up the total, you get
(20*128*128*3 + 4*4*3*32 + 20*128*128*32) * 4 / 1024**2 ≈ 44MB
(In the above, 4 is for the size in bytes of float32 and 1024**2 is to get the result in MB).
Your batch size can be smaller than your number of classes. Think about ImageNet and its 1000 classes: people are training with batch sizes 10 times smaller.
EDIT
Here is a tensorboard screenshot of the net — it reports 40MB rather than 44MB, probably because it excludes the input — and you also have all the tensor sizes I mentioned earlier.

Tensorflow limiting batch size when learning embeddings

I'm trying to learn the state embeddings for a sequence of states produced by a HMM, similar to how the tensorflow Vector Representation of Words does this for text sequences.
My issue is that the "vocabulary" of this HMM is only 12 different states. Tensorflow doesn't seem to like it when I run my code using batches larger than the size of this vocabulary. For example, attempting to train it with a batch size of 14 gives the error:
F tensorflow/core/kernels/range_sampler.cc:86] Check failed: batch_size + avoided_values.size() <= range_ (14 vs. 12)
Abort trap: 6
What is the motivation behind this check?
If you are following the example from the tutorial
This error actually comes when you set the num_sampled > len(vocabulary)
num_sampled = 64 # Number of negative examples to sample.
you cannot indeed sample indexes (for the negative examples in word to vec) beyond the vocabulary size