Properly concatenate feature maps in Tensorflow - tensorflow

I am attempting to reproduce a Convolution Neural Network from a research paper using Tensorflow.
There are many times in the diagram where the results of convolutions are concatenated. Currently I am using tf.concat( along the last axis (representing channels) to concatenate these feature maps. I originally believed that I would want to concatenate along all axes, but this does not seem to be an option in tensorflow. Now I am facing the problem where the paper indicates that tensors(feature maps) of different sizes should be concatenated. tf.concat does not support concatenations of different sizes, so I am wondering if this was the correct command to use in the first place. In summary, what is the correct way to concatenate feature maps(sometimes of different sizes) in tensorflow?
Thank you.

It's impossible and meaningless to concatenate features maps with different sizes.
If you want to concatenate 2 tensors, every dimension except the concatenation one must be equal.
From the image you posted, in fact, you can see that every feature map that gets concatenated, has the same spatial extent (but different depth) of the other one.
If you can't concatenate in that way, probabily that's something wrong in your code, and probably the problem is the lack of padding = valid in the convolution operation.

The problem that you encounter for inception network may be resolved by using padding in convolutional layers to keep the size same. For inception blocks, instead of using "VALID" padding, change it to "SAME" one. So, without requiring any resizing, you can concatenate the outputs.
Alternatively, you can append padding to the feature maps that are going to be concatenated. You can do that by using tf.pad().
If you don't prefer to do this one, you can use tf.image.resize_images function to resize them to same values. However, this is a dirty and computationally expensive approach.

Tensors can only be concatenated along one axis. If you need to concatenate feature maps of different sizes, you must somehow manipulate the sizes of the original tensors.


Use of embeddings to preserve order invariance

I want to recommend an item complementary to a cart of items. So, naturally, I thought of using embeddings to represent items, and I came up to a layer of this kind in keras:
item_input = Input(shape=(MAX_CART_SIZE,), name="item_id")
item_embedding = Embedding(input_dim=NB_ITEMS+1, input_length=MAX_CART_SIZE, output_dim=EMBEDDING_SIZE, mask_zero=True)
I used masking to handle the variable size of the carts. So, the dimensions of the output tensor of this layer is MAX_CART_SIZE x EMBEDDING_SIZE. It means that there are as many different embeddings as there are potential items. In other words, a item can be encoded a different way according to its position within the cart and that's an undesirable behavior... Though, it seems that most neural networks dealing with NLP data work this way, with embeddings not associated with words but with words/indices within a phrase.
So, what would be the correct way to preserve order invariance? In other words, I'd like the cart A,B,C be stricly equivalent to the carts C,B,A or B,A,C in terms of input representation and generated output.
One way of having invariance will be done by using a Transformer architecture WITHOUT using positional embeddings. In this way, each item is encoded to an embedding, and because you do not have a positional embedding, the object embedding is the same even if it is one the first position or on the last one.
Moreover, the Transformer architecture is invariant to such positions as long as you avoid the positional embedding.

Is there any difference between keras.utils.to_categorical and pd.get_dummies?

I think the same purpose among sklearn.OneHotEncoder, pandas.get_dummies, and keras.to_categorical. But I don't know the difference. 
Apart from the difference of the output/input type there is no difference, they all achieve the same result.
There's some technical difference:
Keras is very simple, you give him the target vector and he one -hot encodes it, use keras if you need to encode the labels vector.
Pandas is the most complex, it creates a new column for every class of the data, the good part is that works on dataframes where you want to one-hot only one of the columns (so you could say this is more of a multi purpose method, but not the preferable option if you need to train a NN)
Sklearn lets you one-hot encode multiple features in the same variable, is a bit more flexible that the use keras offers, if the method from keras is too simple try with sklearn, if keras is enough stick with it.

Implement CVAE for a single image

I have a multi-dimensional, hyper-spectral image (channels, width, height = 15, 2500, 2500). I want to compress its 15 channel dimensions into 5 channels.So, the output would be (channels, width, height = 5, 2500, 2500). One simple way to do is to apply PCA. However, performance is not so good. Thus, I want to use Variational AutoEncoder(VAE).
When I saw the available solution in Tensorflow or keras library, it shows an example of clustering the whole images using Convolutional Variational AutoEncoder(CVAE).
However, I have a single image. What is the best practice to implement CVAE? Is it by generating sample images by moving window approach?
One way of doing it would be to have a CVAE that takes as input (and output) values of all the spectral features for each of the spatial coordinates (the stacks circled in red in the picture). So, in the case of your image, you would have 2500*2500 = 6250000 input data samples, which are all vectors of length 15. And then the dimension of the middle layer would be a vector of length 5. And, instead of 2D convolutions that are normally used along the spatial domain of images, in this case it would make sense to use 1D convolution over the spectral domain (since the values of neighbouring wavelengths are also correlated). But I think using only fully-connected layers would also make sense.
As a disclaimer, I haven’t seen CVAEs used in this way before, but like this, you would also get many data samples, which is needed in order for the learning generalise well.
Another option would be indeed what you suggested -- to just generate the samples (patches) using a moving window (maybe with a stride that is the half size of the patch). Even though you wouldn't necessarily get enough data samples for the CVAE to generalise really well on all HSI images, I guess it doesn't matter (if it overfits), since you want to use it on that same image.

Difference between feature_column.embedding_column and keras.layers.Embedding in TensorFlow

I have been using keras.layers.Embedding for almost all of my projects. But, recently I wanted to fiddle around with and found feature_column.embedding_column.
From the documentation:
feature_column.embedding_column -
DenseColumn that converts from sparse, categorical input.
Use this when your inputs are sparse, but you want to convert them to a dense
representation (e.g., to feed to a DNN).
keras.layers.Embedding - Turns positive integers (indexes) into dense vectors of fixed size.
e.g. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
This layer can only be used as the first layer in a model.
My question is, is both of the api doing similar thing on different type of input data(for ex. input - [0,1,2] for keras.layers.Embedding and its one-hot-encoded rep. [[1,0,0],[0,1,0],[0,0,1] for feature_column.embedding_column)?
After reviewing source code for both operations here is what I found:
both operations rely on tensorflow.python.ops.embedding_ops funcitonality;
keras.layers.Embedding uses dense representations and contains generic keras code for fiddling with shapes, init variables etc;
feature_column.embedding_column relies on sparse and contains functionality to cache results.
So, your guess seems to be right: these 2 are doing similar things, rely on distinct input representations, contain some logic that doesn't change the essense of what they do.

Tensorflow batching without extra None dimension?

Is it possible to do batching in tensorflow without expanding the placeholder size by an extra dimension of None? Specifically I'd just like to feed multiple samples via the placeholders through feed_dict. The code base I'm working on would require a large amount of change to the code to account for adding an extra dimension for the batch size.
eg:{var1:val1values, var2: val2values, ...})
Where val1values would represent a batch of size X instead of just one training sample.
The shape information including the number of dimensions is available to Python code to do arbitrary things with, and does affect the ops added to the graph (like which matmul kernel is used), so there's no general safe way to automatically add a batch dimension. Something like labeled_tensor may make code slightly less confusing to refactor.