patch-wise training and fully convolutional training in FCN - tensorflow

In the FCN paper, the authors discuss the patch wise training and fully convolutional training. What is the difference between these two?
Please refer to section 4.4 attached in the following.
It seems to me that the training mechanism is as follows,
Assume the original image is M*M, then iterate the M*M pixels to extract N*N patch (where N<M). The iteration stride can some number like N/3 to generate overlapping patches. Moreover, assume each single image corresponds to 20 patches, then we can put these 20 patches or 60 patches(if we want to have 3 images) into a single mini-batch for training. Is this understanding right? It seems to me that this so-called fully convolutional training is the same as patch-wise training.

The term "Fully Convolutional Training" just means replacing fully-connected layer with convolutional layers so that the whole network contains just convolutional layers (and pooling layers).
The term "Patchwise training" is intended to avoid the redundancies of full image training.
In semantic segmentation, given that you are classifying each pixel in the image, by using the whole image, you are adding a lot of redundancy in the input. A standard approach to avoid this during training segmentation networks is to feed the network with batches of random patches (small image regions surrounding the objects of interest) from the training set instead of full images. This "patchwise sampling" ensures that the input has enough variance and is a valid representation of the training dataset (the mini-batch should have the same distribution as the training set). This technique also helps to converge faster and to balance the classes. In this paper, they claim that is it not necessary to use patch-wise training and if you want to balance the classes you can weight or sample the loss.
In a different perspective, the problem with full image training in per-pixel segmentation is that the input image has a lot of spatial correlation. To fix this, you can either sample patches from the training set (patchwise training) or sample the loss from the whole image. That is why the subsection is called "Patchwise training is loss sampling".
So by "restricting the loss to a randomly sampled subset of its spatial terms excludes patches from the gradient computation." They tried this "loss sampling" by randomly ignoring cells from the last layer so the loss is not calculated over the whole image.


Can SigmoidFocalCrossEntropy in Tensorflow (tf-addons) be used in Multiclass Classification? ( What is the right way)?

Focal Loss given in Tensorflow is used for class imbalance. For Binary class classification, there are a lots of codes available but for Multiclass classification, a very little help is there. I ran the code with One Hot Encoded target variables of 250 classes and it gave me results without any error.
y = pd.get_dummies(df['target']) # One hot encoded target classes
optimizer="adam", loss=tfa.losses.SigmoidFocalCrossEntropy(), metrics= metric
I just want to know whoever wrote this code or someone having enough knowledge of this code, can it be used be used for Multiclass Classification. If no then how come it did not give me errors, instead better results than CrossEntropy. Also, in other implementations like this one, the value of alpha has to be given for every class but just one value in Tensorflow's implementations.
What is the correct way to use this?
Some basics first.
Categorical Crossentropy is designed to incentivize a model a model to predict 100% for the correct label. It was designed for models that predict single-label multi-class classification - like CIFAR10 or Imagenet. Usually these models finish in a Dense layer with more than one output.
Binary Crossentropy is designed to incentivize a model to predict 100% if the label is one, or, 0% is the label is zero. Usually these models finish in a Dense layer with exactly one output.
When you apply Binary Crossentropy to a single-label multi-class classification problem, you are doing something that is mathematically valid but defines a slightly different task: you are incentivizing a single-label classification model to not only get the true label correct, but also minimize the false labels.
For example, if your target is dog, and your model predict 60% dog, CCE doesn't care if your model predicts 20% cat and 20% French horn, or, 40% cat and 0% French horn. So this is aligned with a top-1 accuracy concept.
But if you take that same model and apply BCE, and your model predictions 60% dog, BCE DOES care if your models predict 20%/20% cat/frenchhorn, vs 40%/0% cat/frenchhorn. To put it in precise terminology, the former is more "calibrated" and so it has some additional measure of goodness. However, this has little correlation to top-1 accuracy.
When you use BCE, presumably you are wasting the model's energy to focus on calibration at the expense of top-1 acc. But as you might have seen, it doesn't always work out that way. Sometimes BCE gives you superior results. I don't know that there's a clear explanation of that but I'd assume that the additional signals (in the case of Imagenet, you'll literally get 1000 times more signals) somehow creates a smoother loss value that perhaps helps smooth the gradients you receive.
The alpha value of focal loss additionally penalizes very wrong predictions and lessens the penalty if your model predicts something close to the right answer - like predicting 90% cat if the ground truth is cat. This would be a shift from the original definition of CCE, based on the theory of Maximum Likelihood Estimation... which focuses on calibration... vs the normal metric most ML practitioners care about: top-1 accuracy.
Focal loss was originally designed for binary classification so the original formulation only has a single alpha value. The repo you pointed to extends the concept of Focal Loss to single-label classification and therefore there are multiple alpha values: one per class. However, by my read, it loses the additional possible smoothing effect of BCE.
Net net, for the best results, you'll want to benchmark CCE, BCE, Binary Focal Loss (out of TFA and per the original paper), and the single-label multi-class Focal Loss that you found in that repo. In general, those the discovery of those alpha values is done via guess & check, or grid search.
There's a lot of manual guessing and checking in ML unfortunately.

Large trainable embedding layer slows down training

I am training a network to classify text with a LSTM. I use a randomly initialized and trainable embedding layer for the word inputs. The network is trained with the Adam Optimizer and the words are fed into the network with a one-hot-encoding.
I noticed that the number of words which are represented in the embedding layer influences heavily the training time, but I don't understand why. Increasing the number of words in the network from 200'000 to 2'000'000 almost doubled the time for a training epoch.
Shouldn't the training only update weights which where used during the prediction of the current data point. Thus if my input sequence has always the same length, there should always happen the same number of updates, regardless of the size of the embedding layer.
The number of updates needed would be reflected in the number of epochs it takes to reach a certain precision.
If your observation is that convergence takes the same number of epochs, but each epoch takes twice as much wall clock time, then it's an indication that simply performing the embedding lookup (and writing the update of embedding table) now takes a significant part of your training time.
Which could easily be the case. 2'000'000 words times 4 bytes per float32 times the length of your embedding vector (what is it? let's assume 200) is something like 1.6 gigabytes of data that needs to be touched every minibatch. You're also not saying how you're training this (CPU, GPU, what GPU) which has a meaningful impact on how this should turn out because of e.g. cache effects, as for CPU doing the exact same number of reads/writes in a slightly less cache-friendly manner (more sparsity) can easily double the execution time.
Also, your premise is a bit unusual. How much labeled data do you have that would have enough examples of the #2000000th rarest word to calculate a meaningful embedding directly? It's probably possible, but would be unusual, in pretty much all datasets, including very large ones, the #2000000th word would be a nonce and thus it'd be harmful to include it in trainable embeddings. The usual scenario would be to calculate large embeddings separately from large unlabeled data and use that as a fixed untrainable layer, and possibly concatenate them with small trainable embeddings from labeled data to capture things like domain-specific terminology.
If I understand correctly, your network takes one-hot vectors representing words to embeddings of some size embedding_size. Then the embeddings are fed as input to an LSTM. The trainable variables of the network are both those of the embedding layer and the LSTM itself.
You are correct regarding the update of the weights in the embedding layer. However, the number of weights in one LSTM cell depends on the size of the embedding. If you look for example at the equation for the forget gate of the t-th cell,
you can see that the matrix of weights W_f is multiplied by the input x_t, meaning that one of the dimensions of W_f must be exactly embedding_size. So as embedding_size grows, so does the network size, so it takes longer to train.

Is batchnorm used in neural networks that are not CNN?

1.) Batchnorm is always used in deep convolutional neural networks. But is it also used in not-CNN. In NN. In networks with just fully-connected layers?
2.) Is batchnorm used in shallow CNNs?
3.) If I have a CNN with an input image and an input array IN_array, the output is an array after the last fully-connected layer. I call this array FC_array. If I want to concat that FC_array with the IN_array.
CONCAT_array = tf.concat(values=[FC_array, IN_array])
Is it useful to have a bachnorm after the concat layer? Or should that batchnorm be just after the FC_array before the concat layer?
For information, the IN_array is a tf.one_hot() vector.
Thank you
TL;DR: 1. Yes 2. Yes 3. No
Batch normalization was a great invention by Sergey Ioffe and Christian Szegedy early 2015. Back in those days, battling vanishing or exploding gradients was an everyday problem. Read that article if you want to gain a deep understanding. but basically this quote from the abstract should give you some idea:
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs.
They did in fact first use batch normalization for DCNNs, which allowed them to beat human performance in the top-5 ImageNet classification, but any network where there are nonlinearities can benefit from batch normalization. Including a network consisting of fully-connected layers.
Yes, it is used for shallow CNN-s too. Any network with more than one layer can benefit from it, albeit it is true that more benefit comes to deeper networks.
First of all, one-hot vectors should never be normalized. Normalization means you subtract the mean and divide by the variance, thus creating a dataset with 0 mean and 1 variance. If you do this to a one-hot vector, then the cross-entropy loss calculation will be completely off. Second, there is no point in normalizing a concat layer separately, since it does not change the values, just concatenates them. Batch normalization is done on the input of a layer, so the one after the concat, that will get the concatenated values, can do it if necessary.

tensorflow: batches of variable-sized images

When one passes to tf.train.batch, it looks like the shape of the element has to be strictly defined, else it would complain that All shapes must be fully defined if there exist Tensors with shape Dimension(None). How, then, does one train on images of different sizes?
You could set dynamic_pad=True in the argument of tf.train.batch.
dynamic_pad: Boolean. Allow variable dimensions in input shapes. The given dimensions are padded upon dequeue so that tensors within a batch have the same shapes.
Usually, images are resized to a certain number of pixels.
Depending on your task you might be able to use other techniques in order to process images of varying sizes. For example, for face recognition and OCR, a fix sized window is used, that is then moved over the image. On other tasks, convolutional neural networks with pooling layers or recurrent neural networks can be helpful.
I see that this is quite old question, but in case someone will be searching how variable-size images can be still used in batches, I can tell what I did for Image-to-Image convolutional network (inference), which was trained for variable image size and batch 1. Why: when I tried to process images in batches using padding, the results become much worse, because signal was "spreading" inside of the network and started to influence its convolution pyramids.
So what I did is possible when you have source code and can load weights manually into convolutional layers. I modified the network in the following way: along with a batch of zero-padded images, I added additional placeholder which received a batch of binary masks with 1 where actual data was on the patch, and 0 where padding was applied. Then I multiplied signal by these masks after each convolutional layer inside the network, fighting "spreading". Multiplication isn't expensive operation, so it did not affect performance much.
The result was not deformed already, but still had some border artifacts, so I modified this approach further by adding small (2px) symmetric padding around input images (kernel size of all the layers of my CNN was 3), and kept it during propagation by using slightly bigger (+[2px,2px]) mask.
One can apply the same approach for training as well. Then some sort of "masked" loss is needed, where only the ROI on each patch is used to calculate loss. For example, for L1/L2 loss you can calculate the difference image between generated and label images and apply masks before summing up. More complicated losses might involve unstacking or iterating batch, and extracting ROI using tf.where or tf.boolean_mask.
Such training can be indeed beneficial in some cases, because you can combine small and big inputs for the network without small inputs being affected by the loss of big padded surroundings.

How to handle the BatchNorm layer when training fully convolutional networks by finetuning?

Training fully convolutional nerworks (FCNs) for pixelwise semantic segmentation is very memory intensive. So we often use batchsize=1 for traing FCNs. However, when we finetune the pretrained networks with BatchNorm (BN) layers, batchsize=1 doesn't make sense for the BN layers. So, how to handle the BN layers?
Some options:
delete the BN layers (merge the BN layers with the preceding layers for the pretrained model)
Freeze the parameters and statistics of the BN layers
which is better and any demo for implementation in pytorch/tf/caffe?
Having only one element will make the batch normalization zero if epsilon is non-zero (variance is zero, mean will be same as input).
Its better to delete the BN layers from the network and try the activation function SELU (scaled exponential linear units). This is from the paper 'Self normalizing neural networks' (SNNs).
Quote from the paper:
While batch normalization requires explicit normalization, neuron
activations of SNNs automatically converge towards zero mean and
unit variance. The activation function of SNNs are “scaled
exponential linear units” (SELUs), which induce self-normalizing
The SELU is defined as:
def selu(x, name="selu"):
alpha = 1.6732632423543772848170429916717
scale = 1.0507009873554804934193349852946
return scale * tf.where(x >= 0.0, x, alpha * tf.nn.elu(x))
Batch Normalization was introduced to reduce the internal covariate shift of the input feature maps. Due to change of parameters of each layer after every optimization steps, input distribution of a layer also changes, this slow down the model convergence. By using Batch Normalization we can normalize the input distribution irrespective of the batch_size (whether batch_size =1 or larger).
BN normalizes the input distribution
For convolutional network input for intermediate layer is 4D tensor. [batch_size, width, height, num_filters]. Normalization effect all the feature maps.
delete the BN layers (merge the BN layers with the preceding layers for the pretrained model)
This may further slow down the training step and convergence mayn't be achieved.
Freeze the parameters and statistics of the BN layers
Sometime the input data distribution for retrain/finetune, may vary significantly from the original data used to train the pretrained model used for initialization, Due to which your model may end-up in non-optimal solution.
According to my experiments in PyTorch, if convolutional layer before the BN outputs more than one value (i.e. 1 x feat_nb x height x width, where height > 1 or width > 1), then the BN still works fine even when the batch size is equal to one. However, I suspect that in this case the variance estimate might be very biased since all samples that are used for variance calculation come from the same image. Therefore in my case I still decided to use small batch.
The effective batch size over convolutional layer
I think the CNN-relative section (Section 3.2) in the BN original paper could help. From the point of view of the authors, it should be OK to use batch size = 1 for convolutional layers. The "effective batch size" for convolutional layer actually is batch_size * image_height * image_width.
I do not have an exact answer, but here are my thoughts:
networks with BatchNorm (BN) layers, batchsize=1 doesn't make sense
for the BN layers
The main motivation of BN is to fix the distribution (mean/variance) of the input in the batch. In my opinion, having one element this does not make sense. Judging from the paper
you will need to calculate the mean and the variance for 1 element, which does not make sense.
You can always just remove BN but are you sure you can't afford at least 16 elements in the batch?
My observation is in contrary with Stephan's: using PyTorch on a similar input batch x feat_nb x height x width, where height > 1 or width > 1, I found adding BatchNorm after the last conv and before the last non-linear (sigmoid) actually hurts the accuracy by a big margin. Still trying to make sense out of it..
(batch size = 8)