Difference between spaCy models sm, md, lg - spacy

I can see that in the English spaCy models the medium model performs better than the small one, and the large model outperforms the medium one - but only marginally. However, in the description of the models, it is written that they have all been trained on OntoNotes. The exception being the vectors of md and lg, which have been trained on CommonCrawl. So if all models were trained on the same dataset (OntoNotes), and the only difference is the vectors, why then is there a performance difference for the tasks that don't require vectors? I would love to find out more about each model and the settings they were trained with and so on, but it appears that this information is not readily available.

So if all models were trained on the same dataset (OntoNotes), and the only difference is the vectors, why then is there a performance difference for the tasks that don't require vectors?
I think the missing piece you're looking for is this one: If models are initialised with vectors, those vectors will be used as features during training. Depending on the vectors, this can give the statistical model components you train a significant boost in accuracy.
However, vectors can be quite large, so you typically want to find the best trade-off between model size and accuracy. If vectors were used during training, the same vectors also need to be available at runtime, and you can't easily swap them out – otherwise, the model will perform much worse. The sm model, which wasn't trained with vectors, allows you to load in your own vectors for, say, similarity comparisons, without affecting the predictions of the pre-trained statistical components.
TL;DR: spaCy's sm, md and lg core models were all trained on the same data under the same conditions. The only difference is the vectors that are included, which are used as features and thus have an impact on the model's accuracy.

Related

What is the purpose of a pre-trained network in Faster R-CNN?

I am not able to understand the purpose of a pre-trained network. From what I read, it is used for the RPN and the Classification Network. But I dont't understand how.
CNNs take a notoriously long time to train, especially for more complex models with higher resolutions. In order to avoid the days of training on a high-end GPU, pre-trained models have been made available. You then just have to train on your specific data (assuming your data is similar to the pre-trained data). For instance, if you want to train a CNN to recognize cats in high resolution images, you might want to start with a pre-trained model that recognizes dogs. The training should take a lot, lot less time due to the fact that a lot of the same underlying patterns have already been learned and all your training needs to do is differentiate cats from dogs.

Tensorflow: how to restore only specific hidden layers from checkpoint and use them to build a different computational graph for inference?

Let's say I trained a model with a very complex computational graph tailored for training. After a lot of training, the best model was saved to a checkpoint file. Now, I want to use the learned parameters of this best model for inference. However, the computational graph used for training is not exactly the same as the one I intend to use for inference. Concretely, there is a module in the graph with several layers in charge of outputting embedding vectors for items (recommender system context). However, for the sake of computational performance, during inference time I would like to have all the item embedding vectors precomputed in advance, so that the only computation required per request would just involve a couple of hidden layers.
Therefore, what I would like to know how to do is:
How to just restore the part of the network that outputs item embedding vectors, in order to precompute these vectors for all items (this would happen in some pre-processing script off-line)
Once all item embedding vectors are precomputed, during on-line inference time how to just restore the hidden layers in the later parts of the network and make them receive the precomputed item embedding vectors instead.
How can the points above be accomplished? I think point 1. is easier to get done. But my biggest concern is with point 2. In the computational graph used for training, in order to evaluate any layer I would have to provide values for the input placeholders. However, during on-line inference these placeholders would be obsolete because a lot of stuff would be precomputed and I don't know how to tell hidden layers in the later parts of the network that they should no longer depend on these obsolete placeholders but depend on the precomputed stuff instead.

Image Classification: Heavily unbalanced data over thousands of classes

I have a dataset consist of around 5000 categories of images, but the number of images of every category varies from 20 to 2000, which is quite unbalanced. Also, the number of images are far from enough to train a model from scratch. I decided to do finetuning on pretrained models, like Inception models.
But I am not sure about how to deal with unbalanced data. There are several possible approaches:
Oversampling: Oversample the minority category. But even with aggressive image augmentation technique, we may not be able to deal with overfit.
Also, how to generate balanced batches from unbalanced dataset over so many categories? Do you have some ideas about this pipeline mechanism with TensorFlow?
SMOTE: I think it is not so effective for high dimensional signals like images.
Put weight on cross entropy loss in every batch. This might be useful for single batch, but cannot deal with the overall unbalance.
Any ideas about this? Any feedback will be appreciated.
Use tf.losses.softmax_cross_entropy and set weights for each class inversely proportional to their training frequency to "balance" the optimization.
Start with the pre-trained ImageNet layers, add your own final layers (with appropriate convolution, drop out and flatten layers as required). Freeze all but last few of the ImageNet layers, then train on your dataset.
For unbalanced data (and in general small datasets), use data augmentation to create more training images. Keras has this functionality built-in: Building powerful image classification models using very little data

How filters are initialized in convnet

I read a lot of papers on convnets, but there is one thing I don't understand, how the filters in convolutional layer are initialized ?
Because, for examples, in first layer, filters should detect edges etc..
But if it randomly init, it could not be accurate ? Same for next layer and high-level features.
And an other question, what are the range of the value in those filters ?
Many thanks to you!
You can either initialize the filters randomly or pretrain them on some other data set.
Some references:
http://deeplearning.net/tutorial/lenet.html:
Notice that a randomly initialized filter acts very much like an edge
detector!
Note that we use the same weight initialization formula as with the
MLP. Weights are sampled randomly from a uniform distribution in the
range [-1/fan-in, 1/fan-in], where fan-in is the number of inputs to a
hidden unit. For MLPs, this was the number of units in the layer
below. For CNNs however, we have to take into account the number of
input feature maps and the size of the receptive fields.
http://cs231n.github.io/transfer-learning/ :
Transfer Learning
In practice, very few people train an entire Convolutional Network
from scratch (with random initialization), because it is relatively
rare to have a dataset of sufficient size. Instead, it is common to
pretrain a ConvNet on a very large dataset (e.g. ImageNet, which
contains 1.2 million images with 1000 categories), and then use the
ConvNet either as an initialization or a fixed feature extractor for
the task of interest. The three major Transfer Learning scenarios look
as follows:
ConvNet as fixed feature extractor. Take a ConvNet pretrained on ImageNet, remove the last fully-connected layer (this layer's outputs
are the 1000 class scores for a different task like ImageNet), then
treat the rest of the ConvNet as a fixed feature extractor for the new
dataset. In an AlexNet, this would compute a 4096-D vector for every
image that contains the activations of the hidden layer immediately
before the classifier. We call these features CNN codes. It is
important for performance that these codes are ReLUd (i.e. thresholded
at zero) if they were also thresholded during the training of the
ConvNet on ImageNet (as is usually the case). Once you extract the
4096-D codes for all images, train a linear classifier (e.g. Linear
SVM or Softmax classifier) for the new dataset.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new
dataset, but to also fine-tune the weights of the pretrained network
by continuing the backpropagation. It is possible to fine-tune all the
layers of the ConvNet, or it's possible to keep some of the earlier
layers fixed (due to overfitting concerns) and only fine-tune some
higher-level portion of the network. This is motivated by the
observation that the earlier features of a ConvNet contain more
generic features (e.g. edge detectors or color blob detectors) that
should be useful to many tasks, but later layers of the ConvNet
becomes progressively more specific to the details of the classes
contained in the original dataset. In case of ImageNet for example,
which contains many dog breeds, a significant portion of the
representational power of the ConvNet may be devoted to features that
are specific to differentiating between dog breeds.
Pretrained models. Since modern ConvNets take 2-3 weeks to train across multiple GPUs on ImageNet, it is common to see people release
their final ConvNet checkpoints for the benefit of others who can use
the networks for fine-tuning. For example, the Caffe library has a
Model Zoo where people
share their network weights.
When and how to fine-tune? How do you decide what type of transfer learning you should perform on a new dataset? This is a function of
several factors, but the two most important ones are the size of the
new dataset (small or big), and its similarity to the original dataset
(e.g. ImageNet-like in terms of the content of images and the classes,
or very different, such as microscope images). Keeping in mind that
ConvNet features are more generic in early layers and more
original-dataset-specific in later layers, here are some common rules
of thumb for navigating the 4 major scenarios:
New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to fine-tune the ConvNet due to
overfitting concerns. Since the data is similar to the original data,
we expect higher-level features in the ConvNet to be relevant to this
dataset as well. Hence, the best idea might be to train a linear
classifier on the CNN codes.
New dataset is large and similar to the original dataset. Since we have more data, we can have more confidence that we won't overfit
if we were to try to fine-tune through the full network.
New dataset is small but very different from the original dataset. Since the data is small, it is likely best to only train a
linear classifier. Since the dataset is very different, it might not
be best to train the classifier form the top of the network, which
contains more dataset-specific features. Instead, it might work better
to train the SVM classifier from activations somewhere earlier in the
network.
New dataset is large and very different from the original dataset. Since the dataset is very large, we may expect that we can
afford to train a ConvNet from scratch. However, in practice it is
very often still beneficial to initialize with weights from a
pretrained model. In this case, we would have enough data and
confidence to fine-tune through the entire network.
Practical advice. There are a few additional things to keep in mind when performing Transfer Learning:
Constraints from pretrained models. Note that if you wish to use a pretrained network, you may be slightly constrained in terms of the
architecture you can use for your new dataset. For example, you can't
arbitrarily take out Conv layers from the pretrained network. However,
some changes are straight-forward: Due to parameter sharing, you can
easily run a pretrained network on images of different spatial size.
This is clearly evident in the case of Conv/Pool layers because their
forward function is independent of the input volume spatial size (as
long as the strides "fit"). In case of FC layers, this still holds
true because FC layers can be converted to a Convolutional Layer: For
example, in an AlexNet, the final pooling volume before the first FC
layer is of size [6x6x512]. Therefore, the FC layer looking at this
volume is equivalent to having a Convolutional Layer that has
receptive field size 6x6, and is applied with padding of 0.
Learning rates. It's common to use a smaller learning rate for ConvNet weights that are being fine-tuned, in comparison to the
(randomly-initialized) weights for the new linear classifier that
computes the class scores of your new dataset. This is because we
expect that the ConvNet weights are relatively good, so we don't wish
to distort them too quickly and too much (especially while the new
Linear Classifier above them is being trained from random
initialization).
Additional References
CNN Features off-the-shelf: an Astounding Baseline for Recognition trains SVMs on features
from ImageNet-pretrained ConvNet and reports several state of the art
results.
DeCAF reported similar findings in 2013. The framework in this paper (DeCAF) was a Python-based precursor to the C++ Caffe library.
How transferable are features in deep neural networks? studies the transfer
learning performance in detail, including some unintuitive findings
about layer co-adaptations.

Can Tensorflow Wide and Deep model train to continuous values

I am working with the Tensorflow Wide and Deep model. It currently trains against a binary classification (>50K or not).
Can this model be coerced to train directly against numeric values to produce more precise (if less accurate) predictions?
I have seen an example of using LSTM RNNs to make such predictions using TensorFlowEstimator directly here, but DNNLinearCombinedClassifier will not accept n_classes=0.
I like the structure of the Wide and Deep model, especially the ability to run the linear regression and the DNN separately to determine how learnable the data is, but my application involves data that clusters, but in an overlapping, input-dependent fashion.
Use DnnLinearCombinedRegressor for regression problems.