Setting learningRateMultiplier in CNTK from within Python - cntk

I am loading a pre-trained network and would like to change/set the "learningRateMultiplier" for various layers. I have done this before using Brainscript (link see below), but need to do this now from within Python. Is this supported? Or is there any other way in Python to set per-layer specific learning rates?
Brainscript:
https://github.com/Microsoft/CNTK/wiki/Parameters-And-Constants
To give some context: I would like to fine-tune all layers in Fast R-CNN training including the conv layers. However past experiments indicate that this requires smaller learning rates for the conv layers compared to the fc layers (perhaps due to the gradients from all ROIs being added up or otherwise combined).
Thanks,
Patrick

Unless a better alternative surfaces, I'd recommend creating two learners with different learning rates and disjoint parameters arguments. You can provide a list with multiple learners to your model's Trainer, which should coordinate their use during training.

Related

In CNN, should we have more layers with less filters or less layers with large number of filters?

I am trying to understand the effect of adding more layers vs the effect of increasing the number of filters in the existing layers in a CNN. Consider a CNN with 2 hidden layers and 64 filters each. Consider a second CNN with 4 hidden layers and 32 filters each. The kernel sizes are same in both CNNs. The number of parameters are also same in both cases.
Which one should I expect to perform better? I am thinking in terms of hyperparameter tuning, batch sizes, training time etc.
In CNNs the deeper layers correspond to higher-level features in images. In the first layer, you are going to get low-level features such as edges, lines, etc. So the network uses these layers to create more complex features; using edges to find circles, using circles to find wheels, using wheels to find a car (I skipped a couple of steps but you get the gist).
So to answer your question, you need to consider how complex your problem is. If you are working on something like imagenet, you can expect model with more layers to have the edge. On the other hand, for problems like mnist, you don't need high-level features.

Change the spatial input dimension during training

I am training a yolov4 (fully convolutional) in tensorflow 2.3.0.
I would like to change the spatial input shape of the network during training, to further adjust the weights to different scales.
Is this possible?
EDIT:
I know of the existence of darknet, but it suffers from some very specific augmentations I use and have implemented in my repo, that is why I ask explicitly for tensorflow.
To be more precisely about what I want to do.
I want to train for several batches at Y1xX1xC then change the input size to Y2xX2xC and train again for several batches and so on.
It is not possible. In the past people trained several networks for different scales but the current state-of-the-art approach is feature pyramids.
https://arxiv.org/pdf/1612.03144.pdf
Another great candidate is to use dilated convolution which can learn long distance dependencies among pixels with varying distance. You can concatenate the outputs of them and the model will then learn which distance is important for which case
https://towardsdatascience.com/review-dilated-convolution-semantic-segmentation-9d5a5bd768f5
It's important to mention which TensorFlow repository you're using. You can definitely achieve this. The idea is to keep the fixed spatial input dimension in a single batch.
But even better approach is to use the darknet repository from AlexeyAB: https://github.com/AlexeyAB/darknet
Just set, random = 1 https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov4.cfg [line 1149]. It will train your network with different spatial dimensions randomly.
One thing you can do is, start your training with AlexeyAB repo with random=1 set, then take the trained weights file to tensorflow for fine-tuning.

Strategies for pre-training models for use in tfjs

This is a more general version of a question I've already asked: Significant difference between outputs of deep tensorflow keras model in Python and tensorflowjs conversion
As far as I can tell, the layers of a tfjs model when run in the browser (so far only tested in Chrome and Firefox) will have small numerical differences in the output values when compared to the same model run in Python or Node. The cumulative effect of these small differences across all the layers of the model can cause fairly significant differences in the output. See here for an example of this.
This means a model trained in Python or Node will not perform as well in terms of accuracy when run in the browser. And the deeper your model, the worse it will get.
Therefore my question is, what is the best way to train a model to use with tfjs in the browser? Is there a way to ensure the output will be identical? Or do you just have to accept that there will be small numerical differences and, if so, are there any methods that can be used to train a model to be more resilient to this?
This answer is based on my personal observations. As such, it is debatable and not backed by much evidence. Some things that I follow to get accuracy of 16-bit models close to 32 bit models are:
Avoid using activations that have small upper and lower bounds, such as sigmoid or tanh, for hidden layers. These activations cause the weights of the next layer to become very sensitive to small values, and hence, small changes. I prefer using ReLU for such models. Since it is now the standard activation for hidden layers in most models, you should be using it in any case.
Avoid weight decay and L1/L2 regularizations on weights while training (the kernel_regularizer parameter in keras), since these increase sensitivity of weights. Use Dropout instead, I didn't observe a major drop in performance on TFLite when using it instead of numerical regularizers.

What does DNN mean in a TensorFlow Estimator.DNNClassifier?

I'm guessing that DNN in the sense used in TensorFlow means "deep neural network". But I find this deeply confusing since the notion of a "deep" neural network seems to be in wide use elsewhere to mean a network with typically several convolutional and/or associated layers (ReLU, pooling, dropout, etc).
In contrast, the first instance many people will encounter this term (in the tfEstimator Quickstart example code) we find:
# Build 3 layer DNN with 10, 20, 10 units respectively.
classifier = tf.estimator.DNNClassifier(feature_columns=feature_columns,
hidden_units=[10, 20, 10],
n_classes=3,
model_dir="/tmp/iris_model")
This sounds suspiciously shallow, and even more suspiciously like an old-style multilayer perceptron (MLP) network. However, there is no mention of DNN as an alternative term on that close-to-definitive source. So is a DNN in the TensorFlow tf.estimator context actually an MLP? Documentation on the hidden_units parameter suggests this is the case:
hidden_units: Iterable of number hidden units per layer. All layers are fully connected. Ex. [64, 32] means first layer has 64 nodes and second one has 32.
That has MLP written all over it. Is this understanding correct? Is DNN therefore a misnomer, and if so should DNNClassifier ideally be deprecated in favour of MLPClassifier? Or does DNN stand for something other than deep neural network?
Give me your definition of "deep" neural network and you get your answer.
But yes, it is simply a MLP and a proper naming would be MLPclassifier indeed. But this does not sounds as cool as the current name.
First of all your definition of DNN is a bit misleading.
There are several architectures of deep neural networks. Inclussive Deep Feedforward Networks is nothing more than a multilayered MLP, plus some techniques to make them attractive.
Some works have used "DNNs" to span all Deep Learning architectures, however, by convention, "DNNs" are used to refer to architectures that use deep forward propagation networks, also called Deep Feedforward Networks
The most important example of a Deep Learning Model is the Profound Net Feedforward or Multilayer Perceptron (MLP). MLP is just a mathematical function that maps some sets of input values to output values. The function is formed by the composition of many simpler functions. You can relate each application of a different mathematical function to provide a new representation of the input.
Therefore, it makes sense that this estimator is called DNNClassifier
My advice is to read this book here.

Prevention of overfitting in convolutional layers of a CNN

I'm using TensorFlow to train a Convolutional Neural Network (CNN) for a sign language application. The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting. I've taken several steps to accomplish this:
I've collected a large amount of high-quality training data (over 5000 samples per label).
I've built a reasonably sophisticated pre-processing stage to help maximize invariance to things like lighting conditions.
I'm using dropout on the fully-connected layers.
I'm applying L2 regularization to the fully-connected parameters.
I've done extensive hyper-parameter optimization (to the extent possible given HW and time limitations) to identify the simplest model that can achieve close to 0% loss on training data.
Unfortunately, even after all these steps, I'm finding that I can't achieve much better that about 3% test error. (It's not terrible, but for the application to be viable, I'll need to improve that substantially.)
I suspect that the source of the overfitting lies in the convolutional layers since I'm not taking any explicit steps there to regularize (besides keeping the layers as small as possible). But based on examples provided with TensorFlow, it doesn't appear that regularization or dropout is typically applied to convolutional layers.
The only approach I've found online that explicitly deals with prevention of overfitting in convolutional layers is a fairly new approach called Stochastic Pooling. Unfortunately, it appears that there is no implementation for this in TensorFlow, at least not yet.
So in short, is there a recommended approach to prevent overfitting in convolutional layers that can be achieved in TensorFlow? Or will it be necessary to create a custom pooling operator to support the Stochastic Pooling approach?
Thanks for any guidance!
How can I fight overfitting?
Get more data (or data augmentation)
Dropout (see paper, explanation, dropout for cnns)
DropConnect
Regularization (see my masters thesis, page 85 for examples)
Feature scale clipping
Global average pooling
Make network smaller
Early stopping
How can I improve my CNN?
Thoma, Martin. "Analysis and Optimization of Convolutional Neural Network Architectures." arXiv preprint arXiv:1707.09725 (2017).
See chapter 2.5 for analysis techniques. As written in the beginning of that chapter, you can usually do the following:
(I1) Change the problem definition (e.g., the classes which are to be distinguished)
(I2) Get more training data
(I3) Clean the training data
(I4) Change the preprocessing (see Appendix B.1)
(I5) Augment the training data set (see Appendix B.2)
(I6) Change the training setup (see Appendices B.3 to B.5)
(I7) Change the model (see Appendices B.6 and B.7)
Misc
The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting.
I don't understand how this is connected. You can have hundreds of labels without a problem of overfitting.