Is it possible to reduce the NER model for training on an existing SpaCy model? - spacy

I have an already existing spaCy model which I want to refine with additional training data at runtime.
For example a training dataSet in my training model looks like this:
text="Anna lives in Munich and works at BMW"
entity: name=Anna
entity: city=Munich
entity: company=BMW
In my implementation I take the ner from the existing model before I start my new training with:
nlp = spacy.load(modelPath)
ner = nlp.get_pipe('ner')
and than I train my existing model with my new TrainingData:
# batch up the examples using spaCy's minibatch which is much faster than
batches = minibatch(trainingData, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
#drop=0.5, # dropout - make it harder to memorise data
losses=losses,
)
Now I have the following question:
My existing ner model contains already the three entities with the labels
city, name, company
But my new training dataSet has only the entities 'city' and 'name' (not the entity 'company'). Like
text="Bob lives in London"
entity: name=Bob
entity: city=London
Because only 'city' and 'name' are part of my sentence.
Now I had the impression that my model quality downgrades if I retrain my model with training datasets containing less entities that in the current model knows.
Would it be clever to (re)set the ner in my model with only the entity labels contained in my current training dataset before I start the training?
Something like this:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
ner.add_label('city')
ner.add_label('name')
Or does this not make sense?

Now I had the impression that my model quality downgrades if I retrain my model with training datasets containing less entities that in the current model knows.
Yes. This is called catastrophic forgetting.
Would it be clever to (re)set the ner in my model with only the entity labels contained in my current training dataset before I start the training?
In my opinion, yes. If your current training data doesn't not have company names in it, the model will be become biased as you keep training it, and say in the future you decide to use the same model to detect company names, it will detect city or names as company because it has forgotten what company names are.

Related

Multiple BERT binary classifications on a single graph to save on inference time

I have five classes and I want to compare four of them against one and the same class. This isn't a One vs Rest classifier, as for each output I want to score them against one base class.
The four outputs should be: base class vs classA, base class vs classB, etc.
I could do this by having multiple binary classification tasks, but that's wasting computation time if the first layers are BERT preprocessing + pretrained BERT layers, and the only differences between the four classifiers are the last few layers of BERT (finetuned ones) and the Dense layer.
So why not merge the graphs for more performance?
My inputs are four different datasets, each annotated with true/false for each class.
As I understand it, I can re-use most of the pipeline (BERT preprocessing and the first layers of BERT), as those have shared weights. I should then be able to train the last few layers of BERT and the Dense layer on top differently depending on the branch of the classifier (maybe using something like keras.switch?).
I have tried many alternative options including multi-class and multi-label classifiers, with actual and generated (eg, machine-annotated) labels in the case of multiple input labels, different activation and loss functions, but none of the results were acceptable to me (none were as good as the four separate models).
Is there a solution for merging the four different models for more performance, or am I stuck with using 4x binary classifiers?
When you train DNN for specific task it will be (in vast majority of cases) be better than the more general model that can handle several task simultaneously. Saying that, based on my experience the properly trained general model produces very similar results to the original binary ones. Anyways, here couple of suggestions for training strategies (assuming your training datasets for each task are completely different):
Weak supervision approach
Train your binary classifiers, and label your datasets using them (i.e. label with binary classifier trained on dataset 2 datasets [1,3,4]). Then train your joint model as multilabel task using all the newly labeled datasets (don't forget to randomize samples before feeding them to trainer ;) ). Here you will need to experiment if you will use threshold and set a label to 0/1 or use the scores of the binary classifiers.
Create custom loss function that will not penalize if no information provided for certain class. So when your will introduce sample from (say) dataset 2, your loss will be calculated only for the 2nd class.
Of course you can apply both simultaneously. For example, if you know that binary classifier produces scores that are polarized (most results are near 0 or 1), you can use weak labels, and automatically label your data with scores. Now during the second stage penalize loss such that for score x' = 4(x-0.5)^2 (note that you get logits from the model, so you will need to apply sigmoid function). This way you will increase contribution of the samples binary classifier is confident about, and reduce that of less certain ones.
As for releasing last layers of BERT, usually unfreezing upper 3-6 layers is enough. Releasing more layers improves results very little and increases time and memory requirements.

why keras-tuner .get_best_models() only returns untrained model?

Hi so using keras tuner to do gridsearchs on various hyperparameters.
using
tuner.results_summary(5)
returns hyperparameters and scores of the 5 best models.
We can see then that it is saving a record of the models. But is it only saving the record of hypermeters and not models? When i call
model = tuner.get_best_models(num_models=1)[0]
the model returned is the untrained original default parameters used as a holding place for setting up the model going into the grid search. At least I can access the record of hyperparameters but is it possible to get the best trained model?

Training with spacy on full dataset

When I train my spacy model as follows
spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
the model gets trained on train.spacy data file, and scored on dev.spacy. Then output_updated/model-best is the model with the highest score.
Is this best model finally trained on a combination of both train and dev data? I understand, it makes sense to split those datasets to avoid overfitting, but given little training data, I would like the final model to be trained on all data I have at hand.
No, spaCy does not automatically merge your datasets before training model-best. If you want to do that you would need to manually create a new training data set.
If you have so little data that seems like a good idea, you should probably prioritize getting more data.

Tensorflow : Is it possible to identify the data is used for training?

I have created text classification model(.pb) using tensorflow. Prediction is good.
Is it possible to check the sentence using for prediction is already used to train the model or not. I need to retrain the model when new sentence is given to model to predict.
I did some research and couldn't find a way to get the train data only with the pb file because that file only stores the features and not the actual train data(obviously),but if you have the dataset,then you can easily verify duh....
I don't think you can ever find the exact train data with only the trained model,cause the model only contains the features and not the actual train data

Training trained seq2seq model on additional training data

I have trained a seq2seq model with 1M samples and saved the latest checkpoint. Now, I have some additional training data of 50K sentence pairs which has not been seen in previous training data. How can I adapt the current model to this new data without starting the training from scratch?
You do not have to re-run the whole network initialization. You may run an incremental training.
Training from pre-trained parameters
Another use case it to use a base model and train it further with new training options (in particular the optimization method and the learning rate). Using -train_from without -continue will start a new training with parameters initialized from a pre-trained model.
Remember to tokenize your 50K corpus the same way you tokenized the previous one.
Also, you do not have to use the same vocabulary beginning with OpenNMT 0.9. See the Updating the vocabularies section and use the appropriate value with -update_vocab option.