NER Incremental training with Spacy - spacy

I would like to incrementally train a NER Spacy Model.
By incrementally I mean send a first batch of N training samples, get a first model, then send a second batch of M training samples and get a model identical as if the N+M samples would have been sent in one batch and the model trained.
To be clear, this is not about adding samples after the model has been fully trained. Instead it is the ability to save intermediate states in the model so we can "resume" and add more training samples.
This is very useful if the number of samples is large or to create an "active learning" systems.
It seems doable with NLTK according to this article : and I was wondering if this can be done with Spacy.
So far I have trained my own custom NER model with Spacy using nlp.update but it does not seem to store any intermediate state that supports incremental training.

Yes, this is possible in spaCy. Your approach with nlp.update is correct; once you have added your second batch of training samples, you just need to make a call to nlp.to_disk("/path") (https://spacy.io/usage/saving-loading). Then you can continue this process by loading your saved model again.

Related

Issues in padding(pre-processing) of huggingface gpt2 transformer model and issues with very large dataset during model training

Objective: I am trying to train a Tensorflow Huggingface GPT2 model (language model training from scratch)
Model Description:
Huggingface GPT2 Tensorflow Model
Attached a pic of config. Model Config
Dataset Description:
I have a large dataset (~20GB),
the data is separated into multiple text files with each new line as a training example.
I am facing two issues.
The examples are of different length and I am not sure how to make all the example sizes of same length to feed to the model.
Solutions Tried: We can either pad them, but then I am not sure how to do that in batches in Tensorflow. I searched about data-collator
Doubt: Padding would have to be done to make all the examples of equal size in the batch or across the whole dataset. And would this be with tokens or some other information. (Different Data Collators for Language Modelling etc.)
Since the data is very large, it cannot be loaded in memory at once while training. (Doing model.fit). For that I am not sure how to proceed.
Solutions: I am thinking of training and saving the model on small files but that would require manual intervention or for looping and the model would not be trained on the whole dataset in one go, so if there are other alternatives. Help would be really appreciated.

Why does the loss explode during training from scratch? - Tensorflow Object Detection Models

First of all I want to state out that I am familiar with the benefits of transfer learning. Moreover I am able to train a pretrained model from 'modelzoo' on my dataset. But for research purposes I want to train my model from scratch without transferlearning.
I want to adopt the Faster-RCNN Resnet 101 implementation from tensorsflow's Object Detection API to my dataset. If I use one of the pretrained models the training goes as expected and the loss is always in 'normal' ranges (never above about 6). But if I do not use transferlearning the loss jumps very frequently to extrem high values (about 80,000,000), but between those values the loss is in normal ranges. In addition to this I do not see any predictions of the network on images in TensorBoard. It seems like the network does not make any predictions at all. The only thing which I change is to comment out those two lines in the model.config file:
# fine_tune_checkpoint: 'path'
# from_detection_checkpoint: true
I tried a lot of things to find the reason: Changed optimizer, changed the learning rate, used gradient clipping, changed the initializer used different machines to train on but nothing helps. Moreover I inspected my label_map as well as my record file. To ensure that those files are correct I redid the steps mentioned above by using the pascal voc dataset, the script to create records and the label map from the api, but even with this code from the Object Detection API without any code changes, the loss explodes (Tensorflow Object Detection API own inputs).

Pre Trained LeNet Model for License plate Recognition

I have implemented a form of the LeNet model via tensorflow and python for a Car number plate recognition system. My model was trained solely on my train data and tested on the test data. My dataset contains segmented images wherein every image has only one character in them. This is what my data looks like. My created model does not perform very well, so I'm now looking for models which I can use via Transfer Learning. Since most models, are already trained on a humongous dataset, I looked over a few like AlexNet, ResNet, GoogLeNet and Inception v2. Most of these models have not been trained on the type of data that I want which would be, Letters and digits.
Question: Should I still go forward with one of these models and train them on my dataset or are there any better models which would help ? For such models would keras be a better option since it is more high level than Tensorflow?
Question: I'd prefer to work with the LeNet model itself since training the other models would definitely take a long time due to the insufficient specs of my laptop. So is there any implementation of the model which uses machine printed character images to train the model which I could use to then train the final layers of the model on my data?
to get good results you should use a model explicitly designed for text recognition.
First, (roughly) crop the input image to the region around the text.
Then, feed the image of the text into a neural network (NN) to detect the text.
A typical NN for text recognition extracts relevant features (with convolutional NN), propagates those features through the image (with recurrent NN) and finally predicts a character score for each position in the image.
Usually, those networks are trained with the CTC loss.
As a starting point I would suggest looking at the CRNN implementation (they also provide a pre-trained model) [1] and the corresponding paper [2]. There is, as far as I remember, also a TensorFlow implementation on github.
You can use any framework (e.g TensorFlow or CNTK or ...) you like as long as it features convolutional and recurrent NN and the CTC loss.
I once attended a presentation about CNTK where they claimed that they have a very fast implementation of recurrent NN - so maybe CNTK would be a good choice for your slow computer?
[1] CRNN implementation: https://github.com/bgshih/crnn
[2] Shi - An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

finetuning tensorflow seq2seq model

I've trained a seq2seq model for machine translation (DE-EN). And I have saved the trained model checkpoint. Now, I'd like to fine-tune this model checkpoint to some specific domain data samples which have not been seen in previous training phase. Is there a way to achieve this in tensorflow? Like modifying the embedding matrix somehow.
I couldn't find any relevant papers or works addressing this issue.
Also, I'm aware of the fact that the vocabulary files needs to be updated according to new sentence pairs. But, then do we have to again start training from scratch? Isn't there an easy way to dynamically update the vocabulary files and embedding matrix according to the new samples and continue training from the latest checkpoint?

how can use torch model?

I have a Torch Model which is trained on a large scale dataset (Places Dataset) and it's authors uploaded it on github, i am working on a similar project and i want to make use of it and use it's trained weights instead of use the large dataset to train it and save time and efforts, it is possible ? how can i know the only the trained filters weights? i don't want to copy the code, i only want to make use of it and save time and efforts.
NOTE: I use Tensoflow in my implementation.