This question may be a duplicate, but I could not find the answer on StackOverflow.
Is there a way to generate document vectors with another number of dimensions such as 25 instead of 300? I also checked the spacy documentation but could not find the answer.
Thanks!
The document and word vectors in spaCy are not generated by spaCy, they're actually the pre-trained embeddings built off of a large corpus. For more details on the embeddings you can check Word Vectors and Semantic Similarity in the docs.
If you wanted to use your own embeddings that were 25 dimensional, you could follow the instructions here. SpaCy won't train new embeddings for you, for that I'd recommend gensim.
Related
I am predicting similarities of documents using the pre trained spacy word embeddings. Because I have a lot of domain specific words, I want to fine tune my vectors on a rather small data set containing my domain specific vocabulary.
My idea was to just train the spacy model again with my data. But since the word vectors in spacy are built-in, I am not sure how to do that. Is there a way to train the spacy model again with my data?
After some research, I found out, that I can train my own vectors using Gensim. There I would have to download a pre trained model for example the Google News dataset model and afterwards I could train it again with my data set. Is this the only way? Or is there a way to proceed with my spacy model?
Any help is greatly appreciated.
update: the right term here was "incremental training" and thats not possible with the pre-trained spacy models.
It is however possible, to perform incremental training on a gensim model. I did that with the help of another pretrained vector set (i went with the fasttext model) and then I trained this gensim model trained with the fasttext vectors again with my own corpus. This worked pretty well
If you pre-trained word embeddings with fasttext in your domain and would like to use them with spaCy you can extend/replace the tokens from an existing spaCy model with your new fasttext vocabulary&vectors using something similar to this:
https://github.com/explosion/spaCy/issues/2538#issuecomment-404888091
or from scratch:
https://spacy.io/usage/vectors-similarity#converting
The advantage of this approach is that (1) you can keep using spacy and (2) if some tokens were present in the pre-trained spaCy but not in your corpus you will still be able to use them
This question is of a more conceptual type.
I was using the pre-trained word-vectors of spacy (the de_core_news_md model).
The problem is that I have a lot of domain specific words which all get a 0-vector assignet and overall the results are in gerneral not too good.
I was wondering how one should proceed now.
should I try to fine tune the existing vectors? If so, how would one approach that?
Or, should I just not use the pre-trained word vectors of spacy and create my own?
Edit:
I want to fine tune the pre trained vectors. I've read, that I could train the already trained model again but on my data. Now my question is, how to do that. When I use spacy, i just load the model. Should I download the vectors of spacy and train a gensim model with them and afterwards again with my vectors? Or is there a better way?
Thank you in advance for any input!
I'd like to ask, is it practical to use embeddings and similarity metrics to any form of identification task? If I had a neural network trained to find different objects in a photo, would extracting the fully-connected layers/Dense layers and clustering them be useful?
I've recently found that there is an embeddings projector tool from tensorflow that is very cool and useful. I know that there has been some work in word embeddings and how similar words cluster together. This is the case for faces as well.
Having said that, I want to follow the same methods into analyzing geological sites; can I train a model to create embeddings of the features of a site and use clustering methods to classify?
Yes, we can do that. We can use embeddings for images and visualize the embeddings in the tensorboard.
You can replicate using the fashion mnist embedding example found here for your use case.
I found it was a failure that I had used Gensim with GoogleNews pre-trained model to cluster phrases like:
knitting
knit loom
loom knitting
weaving loom
rainbow loom
home decoration accessories
loom knit/knitting loom
...
I am advised that GoogleNews model does't have the phrases in it. The phrases I have are a little specific to GoogleNews model while I don't have corpus to train a new model. I have only the phrases. And now I am considering to turn to BERT. But could BERT do that as I expected as above? Thank you.
You can feed a phrase into the pretrained BERT model and get an embedding, i.e. a fixed-dimension vector. So BERT can embed your phrases in a space. Then you can use a clustering algorithm (such as k-means) to cluster the phrases. The phrases do not need to occur in the training corpus of BERT, as long as the words they consist of are in the vocabulary. You will have to try to see if the embeddings give you relevant results.
I'm new to tensorflow and would like to know if there is any tutorial or example of a multi-label classification with multiple network outputs.
I'm asking this because I have a collection of articles, in which, each article can have several tags.
Out of the box, tensorflow supports binary multi-label classification via tf.nn.sigmoid_cross_entropy_with_logits loss function or the like (see the complete list in this question). If your tags are binary, in other words there's a predefined set of possible tags and each one can either be present or not, you can safely go with that. A single model to classify all labels at once. There are a lot of examples of such networks, e.g. one from this question.
Unfortunately, multi-nomial multi-label classification is not supported in tensorflow. If this is your case, you'd have to build a separate classifier for each label, each using tf.nn.softmax_cross_entropy_with_logits or a similar one.