I am predicting similarities of documents using the pre trained spacy word embeddings. Because I have a lot of domain specific words, I want to fine tune my vectors on a rather small data set containing my domain specific vocabulary.
My idea was to just train the spacy model again with my data. But since the word vectors in spacy are built-in, I am not sure how to do that. Is there a way to train the spacy model again with my data?
After some research, I found out, that I can train my own vectors using Gensim. There I would have to download a pre trained model for example the Google News dataset model and afterwards I could train it again with my data set. Is this the only way? Or is there a way to proceed with my spacy model?
Any help is greatly appreciated.
update: the right term here was "incremental training" and thats not possible with the pre-trained spacy models.
It is however possible, to perform incremental training on a gensim model. I did that with the help of another pretrained vector set (i went with the fasttext model) and then I trained this gensim model trained with the fasttext vectors again with my own corpus. This worked pretty well
If you pre-trained word embeddings with fasttext in your domain and would like to use them with spaCy you can extend/replace the tokens from an existing spaCy model with your new fasttext vocabulary&vectors using something similar to this:
https://github.com/explosion/spaCy/issues/2538#issuecomment-404888091
or from scratch:
https://spacy.io/usage/vectors-similarity#converting
The advantage of this approach is that (1) you can keep using spacy and (2) if some tokens were present in the pre-trained spaCy but not in your corpus you will still be able to use them
Related
This question is of a more conceptual type.
I was using the pre-trained word-vectors of spacy (the de_core_news_md model).
The problem is that I have a lot of domain specific words which all get a 0-vector assignet and overall the results are in gerneral not too good.
I was wondering how one should proceed now.
should I try to fine tune the existing vectors? If so, how would one approach that?
Or, should I just not use the pre-trained word vectors of spacy and create my own?
Edit:
I want to fine tune the pre trained vectors. I've read, that I could train the already trained model again but on my data. Now my question is, how to do that. When I use spacy, i just load the model. Should I download the vectors of spacy and train a gensim model with them and afterwards again with my vectors? Or is there a better way?
Thank you in advance for any input!
I found it was a failure that I had used Gensim with GoogleNews pre-trained model to cluster phrases like:
knitting
knit loom
loom knitting
weaving loom
rainbow loom
home decoration accessories
loom knit/knitting loom
...
I am advised that GoogleNews model does't have the phrases in it. The phrases I have are a little specific to GoogleNews model while I don't have corpus to train a new model. I have only the phrases. And now I am considering to turn to BERT. But could BERT do that as I expected as above? Thank you.
You can feed a phrase into the pretrained BERT model and get an embedding, i.e. a fixed-dimension vector. So BERT can embed your phrases in a space. Then you can use a clustering algorithm (such as k-means) to cluster the phrases. The phrases do not need to occur in the training corpus of BERT, as long as the words they consist of are in the vocabulary. You will have to try to see if the embeddings give you relevant results.
I would like to incrementally train a NER Spacy Model.
By incrementally I mean send a first batch of N training samples, get a first model, then send a second batch of M training samples and get a model identical as if the N+M samples would have been sent in one batch and the model trained.
To be clear, this is not about adding samples after the model has been fully trained. Instead it is the ability to save intermediate states in the model so we can "resume" and add more training samples.
This is very useful if the number of samples is large or to create an "active learning" systems.
It seems doable with NLTK according to this article : and I was wondering if this can be done with Spacy.
So far I have trained my own custom NER model with Spacy using nlp.update but it does not seem to store any intermediate state that supports incremental training.
Yes, this is possible in spaCy. Your approach with nlp.update is correct; once you have added your second batch of training samples, you just need to make a call to nlp.to_disk("/path") (https://spacy.io/usage/saving-loading). Then you can continue this process by loading your saved model again.
I've trained a seq2seq model for machine translation (DE-EN). And I have saved the trained model checkpoint. Now, I'd like to fine-tune this model checkpoint to some specific domain data samples which have not been seen in previous training phase. Is there a way to achieve this in tensorflow? Like modifying the embedding matrix somehow.
I couldn't find any relevant papers or works addressing this issue.
Also, I'm aware of the fact that the vocabulary files needs to be updated according to new sentence pairs. But, then do we have to again start training from scratch? Isn't there an easy way to dynamically update the vocabulary files and embedding matrix according to the new samples and continue training from the latest checkpoint?
Can we use spacy with MXnet to build a deep neural network(NLP)
We are building an application using mxnet. How to use spacy with Mxnet
Spacy and MXNet serialize their models differently so they are not directly compatible.
You can leverage the pre-trained models of Spacy as part of a preprocessing step for your text data though, and then feed into an MXNet model. Aim to get your text data into an NDArray format (using mx.nd.array).
Also take a look at MXNet's Model Zoo (https://mxnet.apache.org/model_zoo/index.html) which contains a number of models for NLP tasks; Word2Vec embedding being one example.